I’ve been doing some updating this week rather than anything new. I was going to spend time trying to complete the places section of the Open Correspondence website. It needs some tidying up as the endpoint has had some changes made to it. I did come across an issue which has implications in exposing other pieces of metadata, such as people who are being referred to.
Firstly, I need to work out a more exact way of mapping the data in the database or flat file. I think what I really need is to use something like:
- place
- address
- city
- latitude
- longitude
- description
- url
The data that I have is not quite as granular as this. Yet. When I’ve done this, I need to build the mapping so that if a place is entered, say Hotel Meurice, Paris, then I can return the details and latitude / longitude to render an Open layers map. That’s almost the easiest bit really.
The second issue is the difference in names. Over time and in the heat of writing, names can change subtly. For instance Gads Hill Place, one of Dickens’s homes which is now a school. In the letters it is referred to as
- Gad’s Hill Place,
- Gad’s Hill Place, Higham
- Gad’s Hill
It can also be known as Gadshill Place or Gads Hill Place. I need to find a way of differencing the terms. Firstly I need to develop a way of checking inside a term and then returning it if it is a new terms or returning the mapped version if it matches a term. Secondly I need to fuzzy match the strings so that any near differences (using the Levenshtein edit distance) can be checked and either ignored or exclude the term.
These issues will also affect the correspondent code which is being created. I suspect that anything with names will have the same issues. ForĀ instance, Wilkie Collins is known in the letters as Mr W Wilkie Collins and Mr Wilkie Collins. In the current implementation of the site, these are two different entities which is clearly wrong. They are the same entity but there is a subtle difference which is not accounted.
So to deal with this, I am going back to the parsing library and building these in instead. Whilst it is a slower way of dealing with these issues, it provides a chance of doing any necessary information and site re-thinking.
As part of this, I downloaded some TEI guidelines from the California Digital Library to use to build the base metadata export. Ideally what I’m hoping to do is to create the data as a Python dictionary and then reformat into HTML, HTML & RDFa, RDF, JSON or XML. It should allow me to export the same data for each type.
I’m sure at times I’ll wonder what I started but it needs doing if the site is to accept more authors. After that, back to search.
On a separate note, I have also done some work on the Arts Funding search. I’ve given it a re-skin and used the Accordion widget from the JQuery UI. It also has some more search options built in so that the data can be searched by date and amount as well as political constituency and art form. The search needs to take in some arguments such as < or > or equals in the amount but that can come. I’ve been reading Jenni Tennison’s post on the data.gov.uk blog to best expose the data using Linked Data.
Whilst writing this post, it occurs to me that whilst Linked Data is an awesome way of exposing data, useful search is still an important part of any content driven website. As blogged before, I have implemented an early version of a Xapian search. As Tim Bray has noted, advanced search might have a smaller use but it is more likely to be used by the heavier users so deserves to have time taken on it.
No Comments