I’ve been talking with Rufus Pollock about moving the Open Correspondence web site as we’ve had the occasional snafu with bringing the site back up after maintenance. I’m pleased to say that we managed the move last night and the site is back up, DNS moved and so on. The one thing that really surprised me is that the original site was running the project with a SQLite db engine which ships natively with Pylons. I use MySQL on both Linux and Windows and I believe that it has been tested (or currently runs) on PostgreSQL but I might have misunderstood that conversation. Possibly it explains why the original endpoint could be flaky and disappear.
I’ve been working on two things for the new version: timeline and full text search. Xapian has now been installed as the search engine but the move has meant that Ineed to make one or two changes which means that it is not live yet. I hope to have the issue resolved shortly as it is not huge really.
There is also a nascent timeline of all of Dickens’s letters which still has some issues, like taking its time to load loading as there are around 1000 items. I’ve a feeling that Rufus might be looking at this when he has a moment.
Last week, I posted the next steps that I’d like to take with the site to the Open Knowledge Foundation help and open literature lists. The next thing is to start exploring geographical information and to expose that data.
The upshot is that it is time to re-look at the parsing methods and to really beef them up. They sufficed for us to get the project ported from the original PHP and to get the site up but now I think I have to relook at each method, write more unit tests (and combine them into the openletters tests as they are separate). As the project gets bigger, the value of unit tests becomes much more apparent in terms of ensuring that we have not broken anything which far outweighs the time taken to set them up. It is habit that I need to force myself to continually do when developing. I’ve do this for some of the systems that I’m building at work using PHPUnit).
The second point that comes from this is storing the data. Currently the site reparses the data for the end points but we’ve been talking about using CouchDB and possibly using GeoCouch for the next version.The idea is that we can then store the data and then transform it to the correct format when requested.
In part I’ve decided the only way of finding out how the site might be used is to, well, dog food it. So I’ve also started writing a Java client using Jena for Sparql to retrieve lists from the rdf endpoint and to represent them in XML, JSON or HTML. Currently the SPARQL query is built (though need schanges due to last night’s move as the RDF endpoint moved) and I need to do to complete the change from a List<QuerySolution> into a readable form like XML and so on. The idea is that it will come as a JAR which can be packaged onto the class path of a WAR or another system.
I’ve also got a Python script to cluster the letters on the go as well which I will commit to the Python bitbucket repo once I’ve got a bit more done on it as it currently only builds the intial matrix so I need to visualise the data next.