Using Tesseract with Python for OCR

Following several conversations with Alex Butterworth over pots of tea in the crypt of St Mary’s Church in Oxford, I’ve been having a look at Python and its bindings with the Tesseract library.

A quick Google search brought me to this post by Roy on building an HTTP service using Tornado. I am fairly new to Tornado but have been looking at it for an experiment at work. However I have been using Flask for other projects such as a quick and dirty RDF browser largely based on Chris Gutteridge’s PHP browser. (At some point very soon, I need to get back to this but other projects are slightly more pressing at the moment.)

After a quick upgrade to version 0.8, I have managed to put something together a little like Roy’s script but I’m hoping to go further and add a storage layer before tidying it all up.

Unlike Roy’s script, I’ve pushed the Tesseract and file handling code outside of the server. In the long term, I’d like to split out the file handling and storage facilities from the web server which means looking at the storage. As a quick step, I’ve popped in a link to a MySQL database but a far better option would probably be a NoSQL database like CouchDB or similar. I suppose a Key / Value store like Redis could be used as well (Redisfs apparently does something like this) as a back end. I’m keeping options open.

I do have a temptation to use RabbitMQ to notify various workers that a file exists which suggests that if I’m hoping to use this as a book scanner back end (discussed at the Textcamp event in August), then I need to add in an automated set of scripts which reads a directory and deals with the file and moving, storing and scanning them. Perhaps Tornado might be a long term answer but realistically it is not needed for a test project.

Also, Tesseract will need some training as I’ve discovered this evening playing with some newspaper text and seeing some of the results. As one of the reason I began this was to store old fanzines and newspaper articles which I’ve stored for research but are now degrading, that might be a problem.

Either way, this is a way of moving ahead with the book scanner conversation and building something small to scratch some itches.