Searching Open Correspondence with Xapian

As part of the continuing work on Open Correspondence, I managed to install Xapian to act as a full text search engine. I’ve been looking to do this for a while and had started on working on a remote back end (as blogged here) but decided not to use it as it appears to have a lack of security if being used on different machines across the web. I suppose you could place it behind web service and expose it that way if you want to create a secure remote back end.

The search  is rather basic, a simple form to enter a phrase or words, and the results show the text and the letter url. On the list of things to do is to create an advanced form to allow the user to filter the results down further by date or to find relevance in the text.

From what I can see there are things that I can do on top of the simple search to achieve this. It would be useful to be able to cut the selection down by date which could be parsed from the text and anything not in it is discarded. Perhaps making the searches less naive and trying to discover relevance in the results. Perhaps there is somebody called Nickleby in the letters who is not part of the novel, ‘Nicholas Nickleby’.

Simply put there is a fair amount of data munging that needs to go on next. That’s fine.

The next step that I’m working on is the use of OFS to run some of the endpoints and XML streams that are used for internal purposes, such as locations or the RDF endpoint. I’m hoping to use it to bring through the Linked Data into the letters themselves. I’m looking at using these mainly for performance reasons. Along with a hack on the places that I’m hoping to do next week, the man body of Open Correspondence will be done.

Next up is better data munging and information extraction, such as rewriting the parser and adding more letters into the database. Essentially I’d like to provide better data in accessible formats for the letters and to perhaps offer some tools to kickstart development.

I’m going to the Research Databases in the Humanities workshop to see what else we can do with the data and the site.