Bookmark and Share

Performing Optical Character Recognition in Java

Posted: Monday, June 8th, 2015 at 10:39 pmUpdated: Monday, June 8th, 2015 at 10:39 pm

Overview

There are several engines to perform optical character recognition (OCR) in Java. This stackoverflow question is precisely regarding to that.

Personally, I’ve tested Abby and Tesseract. Although I found Abby has better result, but I don’t think it worth the price tag. So I’ve chosen to go with Tesseract.

Java bindings for Tesseract

There are several libraries that enables Java programmer to access Tesseract C API.

  • Tess4J. In the words of the author, it is “A Java JNA wrapper for Tesseract OCR API.”
  • jtesseract. It’s a Java library for Tesseract generated by jnaerator. Essentially, it’s like you’re interacting with C, but in Java.
  • JavaCPP Presets for Tesseract. I like their idea of making native libraries accessible from Java.
  • Command line interface using Runtime.exec() and calling Tesseract’s program directly.

I really dig the idea of JavaCPP. It was the first solution I tried. jtesseract seems too complex. I’m not really interested in getting into the detail. All I want is to do OCR without really caring about the details and complexity of the library.

Off course the simplest would be to simply call Tesseract’s executable using Runtime.exec(). However, it won’t work well for me since I’ll be manipulating images first. Calling command line program means I have to save the manipulated images to disk first. I feel it’ll be too cumbersome.

So after looking at the options above, I’ve chosen to use the simple library Tess4J.

Configuring Tess4J

The developer website doesn’t mention it, but there’s a Maven repository entry for it here. Then all you have do is to add the following dependency on your pom.xml.

<dependency>
	<groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j</artifactId>
	<version>2.0.0</version>
</dependency>

However, I’m using MacBook Pro for my development platform. As with others programmers who tried to use Tess4J on Mac, I was also faced with the error

Unable to load library ‘tesseract’: Native library (darwin/libtesseract.dylib)

What I found is that the JAR file in the Maven repo (tess4j-2.0.0.jar) only includes native library for Windows platform. Here’s how I found it out:

user@laptop:~$ jar tf /Users/user/.m2/repository/net/sourceforge/\
tess4j/tess4j/2.0.0/tess4j-2.0.0.jar
... (redacted)
win32-x86/gsdll32.dll
win32-x86/liblept170.dll
win32-x86/libtesseract303.dll
win32-x86-64/gsdll64.dll
win32-x86-64/liblept170.dll
win32-x86-64/libtesseract303.dll
... (redacted)

I figured that I probably should just add darwin directory and since I’ve installed Tesseract through MacPorts based on their recommendation, I have libtesseract.dylib in /opt/local/lib, I figure I should probably add darwin/libtesseract.dylib to tess4j-2.0.0.jar.

So I ended up doing this (adjust your directory to match your local maven repository cache).

user@laptop:~$ mkdir darwin
user@laptop:~$ jar uf tess4j-2.0.0.jar darwin
user@laptop:~$ cp /opt/local/lib/libtesseract.3.dylib darwin/libtesseract.dylib
user@laptop:~$ jar uf tess4j.jar darwin/libtesseract.dylib
user@laptop:~$ jar tf tess4j-2.0.0.jar
... (redacted)
win32-x86/gsdll32.dll
win32-x86/liblept170.dll
win32-x86/libtesseract303.dll
win32-x86-64/gsdll64.dll
win32-x86-64/liblept170.dll
win32-x86-64/libtesseract303.dll
darwin/
darwin/libtesseract.dylib
... (redacted)

Tess4J now are able to load native library Tesseract. I would imagine similar technique can be applied for Linux or FreeBSD as well. However, consider that this is a hack as it’s basically changing your local cached tess4j-2.0.0.jar. You may want to redo the steps if the cache is cleared or the JAR file is updated.

Using Tess4J in your java code

One thing that draws me to use Tess4J is because I already have BufferedImage from processing the file before hand. To do OCR on Tess4J is then as simple as writing 2 lines of code:

BufferedImage imageToBeOCRed = ...; // Processing to get the BufferedImage

Tesseract ocr = new Tesseract();
try {
	String text = ocr.doOCR(imageToBeOCRed);
	System.out.println("OCRed text: " + text);
} catch (TesseractException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

So there you have it. As always, I welcome comments / questions / critics that will help me and other readers understand better.

Tags: , , ,

One Response to “Performing Optical Character Recognition in Java”

  1. Quan Nguyen Says:

    You need not to hack tess4j.jar, unless you plan to redistribute the modified package. You can set either the “jna.library.path” property or DYLD_LIBRARY_PATH system variable so JNA can find libtesseract.dylib object.

    See http://tess4j.sourceforge.net/tutorial/

Leave a Reply