OCR

For no really good reason, I found myself looking up OCR (“Optical Character Recognition”) components on the weekend. OCR is the technology that looks at a scanned image of a page and figure out the typewritten text (handwriting recognition is usually an even harder problem).

(Yes, this is the kind of thing I do on my weekends. Yes, I know I’m a total nerd).

I didn’t find any open source Java implementations. There are a few commercial products, though:

But what really grabbed my attention was this: Google releases open source OCR. The package is called Tesseract, and it was open sourced last September.

So, since I was already being a complete nerd, I downloaded the package and pondered some ways of integrating it with Java: JNI? Rewrite it?

And somewhere around there, I found myself reading the source code and thinking: “boy howdy, have I forgotten a lot of C++”.

It's only fair to share...
Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

Leave a Reply