Overview
There are several engines to perform optical character recognition (OCR) in Java. This stackoverflow question is precisely regarding to that.
Personally, I’ve tested Abby and Tesseract. Although I found Abby has better result, but I don’t think it worth the price tag. So I’ve chosen to go with Tesseract.
Java bindings for Tesseract
There are several libraries that enables Java programmer to access Tesseract C API.
- Tess4J. In the words of the author, it is “A Java JNA wrapper for Tesseract OCR API.”
- jtesseract. It’s a Java library for Tesseract generated by jnaerator. Essentially, it’s like you’re interacting with C, but in Java.
- JavaCPP Presets for Tesseract. I like their idea of making native libraries accessible from Java.
- Command line interface using Runtime.exec() and calling Tesseract’s program directly.
I really dig the idea of JavaCPP. It was the first solution I tried. jtesseract seems too complex. I’m not really interested in getting into the detail. All I want is to do OCR without really caring about the details and complexity of the library.
Off course the simplest would be to simply call Tesseract’s executable using Runtime.exec(). However, it won’t work well for me since I’ll be manipulating images first. Calling command line program means I have to save the manipulated images to disk first. I feel it’ll be too cumbersome.
So after looking at the options above, I’ve chosen to use the simple library Tess4J.
Configuring Tess4J
The developer website doesn’t mention it, but there’s a Maven repository entry for it here. Then all you have do is to add the following dependency on your pom.xml.
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>2.0.0</version> </dependency>
However, I’m using MacBook Pro for my development platform. As with others programmers who tried to use Tess4J on Mac, I was also faced with the error
Unable to load library ‘tesseract’: Native library (darwin/libtesseract.dylib)
What I found is that the JAR file in the Maven repo (tess4j-2.0.0.jar) only includes native library for Windows platform. Here’s how I found it out:
user@laptop:~$ jar tf /Users/user/.m2/repository/net/sourceforge/\ tess4j/tess4j/2.0.0/tess4j-2.0.0.jar ... (redacted) win32-x86/gsdll32.dll win32-x86/liblept170.dll win32-x86/libtesseract303.dll win32-x86-64/gsdll64.dll win32-x86-64/liblept170.dll win32-x86-64/libtesseract303.dll ... (redacted)
I figured that I probably should just add darwin directory and since I’ve installed Tesseract through MacPorts based on their recommendation, I have libtesseract.dylib in /opt/local/lib, I figure I should probably add darwin/libtesseract.dylib to tess4j-2.0.0.jar.
So I ended up doing this (adjust your directory to match your local maven repository cache).
user@laptop:~$ mkdir darwin user@laptop:~$ jar uf tess4j-2.0.0.jar darwin user@laptop:~$ cp /opt/local/lib/libtesseract.3.dylib darwin/libtesseract.dylib user@laptop:~$ jar uf tess4j.jar darwin/libtesseract.dylib user@laptop:~$ jar tf tess4j-2.0.0.jar ... (redacted) win32-x86/gsdll32.dll win32-x86/liblept170.dll win32-x86/libtesseract303.dll win32-x86-64/gsdll64.dll win32-x86-64/liblept170.dll win32-x86-64/libtesseract303.dll darwin/ darwin/libtesseract.dylib ... (redacted)
Tess4J now are able to load native library Tesseract. I would imagine similar technique can be applied for Linux or FreeBSD as well. However, consider that this is a hack as it’s basically changing your local cached tess4j-2.0.0.jar. You may want to redo the steps if the cache is cleared or the JAR file is updated.
Using Tess4J in your java code
One thing that draws me to use Tess4J is because I already have BufferedImage from processing the file before hand. To do OCR on Tess4J is then as simple as writing 2 lines of code:
BufferedImage imageToBeOCRed = ...; // Processing to get the BufferedImage Tesseract ocr = new Tesseract(); try { String text = ocr.doOCR(imageToBeOCRed); System.out.println("OCRed text: " + text); } catch (TesseractException e) { // TODO Auto-generated catch block e.printStackTrace(); }
So there you have it. As always, I welcome comments / questions / critics that will help me and other readers understand better.
You need not to hack tess4j.jar, unless you plan to redistribute the modified package. You can set either the “jna.library.path” property or DYLD_LIBRARY_PATH system variable so JNA can find libtesseract.dylib object.
See http://tess4j.sourceforge.net/tutorial/