Archive for the ‘Information Retrieval’ Category

Bibliographica – open bibliographic sourcing and maintenance

Sunday, January 24th, 2010

Jonathan Gray of the Open Knowledge Foundation has a thought provoking post on the need for an Open Bibliographic Service which he calls Bibliographica. As he writes:

lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing. Books and articles are published and translated all the time. Works fall in and out of fashion. ‘Secondary’ reference works can become obsolete – considered interesting more for what they say about a particular intellectual period than what they say about their subject matter.

I’ve been working on my own book as an independent researcher and wanted to know common books and articles in the area. As a user I wanted to know what was published in a particular area and what the points of commonality are to identify key works. Jonathan’s idea would be a help for this and, perhaps more importantly, provide a shared platform form.

As he identifies, sites like Amazon and LibraryThing allow for the user to create lists of books but over time, fashions change and books fall into and out of favour. Being able to compile searchable, sortable lists would allow the user to develop comprehensive lists (and also allow the intellectual historian to figure out zeitgeist’s from lists) and also realise the web’s potential for knowledge sharing which should go beyond mere surfing and into finding the source material and perhaps surprising links between data sets.

His specification, I think, offers a fertile starting point. It appears to source from and link to existing sources rather than re-invent the wheel and to also use existing technologies and ontologies like MARC and Dublin Core. I think that the specification is also sensible in its identification of users and groups to create and edit lists. It mentions that the service could be run by individual universities but what would be extremely useful (but perhaps would not happen) if these silos could then link to each other via interfaces to create continually updated communal resources rather than being individual silos.

Perhaps this is a slightly off topic thought but I’d love to know which books referred to each other, so that we could examine whether Foo writing Bar read the book by Baz which would be an indicator of influence.

The Bibliographica idea mixes “traditional” scholarship with crowd sourcing and is a sensible and potentially useful idea and service. I think it would need to build a critical mass of data and sources to be really useful but it could encourage use of resources.

UPDATE: Just one of those thoughts I had whilst making some lemon tea. Actually one of the challenges would be normalising the data sources to update the sources and pull in from the external sources.

Full text search using PHP and MySQL

Tuesday, December 29th, 2009

I’ve been thinking about full text searching for the letters project and trying to find various solutions that are open source. On the Open Shakespeare and Open Milton sites, we used the Xapian  project which is an excellent search engine. However I wanted to try and find a way of getting a search running using PHP and MySQL which is what the site uses at the moment although I’d be happy to also use Perl. (I also wanted to impose a limit to use technologies that I currently use at my current job.)

I started with reading an article  on the Zend site that offers an overview of setting up a table to run with a Full-Text index. As the article mentions, you have to ensure that the column which is being searched is either VARCHAR or TEXT as MySQL. If it is not in either form, then just alter the column using

ALTER TABLE <tablename> MODIFY <column> TEXT

(or VARCHAR but TEXT is probably preferable). What the Zend article does not mention is that the table type needs to by MyISAM rather than use InnoDB (which means that transactions won’t work on the table). Having made the alteration, I ran the query:

SELECT * , MATCH (<table>) AGAINST (‘<search term>’) AS score FROM <table> WHERE MATCH (<table>) AGAINST (‘<search term>’)

The table returns all the columns with a score against the term.

The SQL code just needs calling as you would any other form of database code. I’m still playing with this but I’ve been ordering the table by the score descending (ORDER BY score DESC) so that the most relevant results are posted for the user.

I do think that I need to do some pre-processing on my own results set to highlight relevance and to extract further semantic meanings for results. For example the publisher ‘Chapman and Hall’ that I could run on the Dickens letters (http://austgate.co.uk/dickens/search.php?term=Chapman&submit=Submit+Query) could equally pull up other businesses or people. I still need to write a parser that can make some sort of judgement even if it is a guess.

I’m sure as I carry on developing the engine and bringing everything together for the project, I’ll have further thoughts on the creation of an engine and creating a more advanced version. This does at least give me a start using current tools (though it is perhaps not as good as Xapian but sometimes you have to at least learn some of the basics).

Update on the Letters of Dickens

Sunday, November 22nd, 2009

Just started on a new version of the Dickens letters which I’m trying to improve before adding in further volumes of text and other authors.

I’ve refactored some of the code to remove some of the cruft and obsolescence. I’ve also been working on the rdf so that I can build up the RDFa links for each letter.

This will be linked to the full text search  of the letter text that I’m going to explore using MySQL (which appears to be Xapian like in some parts). It is only going to be  a first stop as I think that further processing might well be needed to make the links more explicit and the search more relevant. Might well look into increasing the search possibilities for finding letters.

In the future, I’m going to look into annotation bits and pieces and software.

Letters of Charles Dickens website

Friday, September 18th, 2009

I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens.  The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.

I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF)  and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.

Mining the Letters of Charles Dickens

Tuesday, July 14th, 2009

As an aside I’ve started  a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.

Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.

Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.

Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.

I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.

Twittering RSS

Monday, July 13th, 2009

The slowness or lack of real time on RSS feeds has reared its head again in terms of getting news out quickly and in “real-time”. Erick Schonfeld on Techcrunch wants to speed them up and  John Biggs has decided that RSS needs to RIP.

I’ve been working on Twittering RSS feeds for the JISCMail service and getting the service news feeds to become tweets using Perl using XML::FeedPP and LWP::UserAgent. I’ve even got a script reading Twitter and posting back any posts from the account to an email address so that the helpline doesn’t need to constantly log into update itself.

Clearly RSS on its own is not going to help with the constant stream of news attention required by some users. It does for most people I suspect who are not running in real time but messaging systems on the web are changing and it is getting faster which perhaps demands a rethink  of how silos, like Twitter and Facebook, and protocols, like RSS, work together.

I noticed that the pubsubhubub solution that Erick points to builds on Atom and pushes via an IM style solution. Andy Skelton at Wordpress has developed a Jabber plug in (which I suppose goes some way to alleviating the problem but only for Wordpress).

Pushing content and transforming it into a different protocol is the easiest way currently to make sure that news or events are ported into different services and that the community can be developed. Building and updating communities has never been easier or frustrating at the same time trying to see how the different services talk to each other and how to build “real-time” update when necessary.

The changing community of publishing

Wednesday, May 13th, 2009

The New York Times had a piece on digital piracy of books and the contrasting views which was picked up by Slashdot. Starting out from the anti-piracy view, it does note that bestsellers are often the most pirated books which backs up Cory Doctorow’s assertion:

“I really feel like my problem isn’t piracy,…It’s obscurity.”

His own position of publishing free digital copies at the same time as the paid for “treeware” version comes out has helped the all important word of mouth get about his books. He has built a passionate community around his work who both download and pay for books. Through his acknowledgement that there will be cheap skates who will only download the free version but encouraging the rest of the community to be involved in discussing  and remixing his work, his latest novel stayed in the NY Times bestseller for seven weeks.

There must, however, be an acknowledgement that the creator has rights to the work. Doctorow uses Creative Commons to protect his original work but to allow users certain rights to do something with the work. The Open Definitions also do this. Through a simple transformation of rights as open shops rather than closed, i.e. changing to saying what you can do, rather than what you cannot, could change publishing and how it reacts to piracy.

So perhaps publishers need to accept that there will always be a certina amount of it going on. However they should not see piracy as open (it’s not and never will be). The challenge, I believe, for publishers is how to digitise and make available works to a community and allow the community to do things with the books and find new markets and models that way.

The transition would be rough and mistakes made but they need to happen. Publishing needs to learn the lessons of iTunes rather than seeing the digital world as Napster.

It would be great to link into publisher versions of books to create citations or from which to construct models in blogs and wikis using community licenses. It would allow for publisher works to be re-used, ideally be open but perhaps operate on micro-payments based on traffic or level of citation, and for the user to have some authentication (or not depending on publisher) of the data as coming from a reliable source.

Just a thought but the time is ripe for change and experimentation.

XML in Milton and Shakespeare

Wednesday, April 22nd, 2009

As part of the Open Milton project, I’ve been thinking about the place of  XML in it. Over Christmas, I wrote a small XSL transform using the Bosak XML Shakespeare files. Rufus took Anthony and Cleopatra and,  using Latex (I gather), created the Open Shakespeare Anthony and Cleopatra pdf.

At one level, this is yet another version of Shakespeare. True.

But think of the possibilities. A user could happily generate their own version of the play (for instance using it in a class) or create their own annotated version for that class and not have to worry too much about losing the text / book as it can be printed and shared widely. Communities of interested parties could be pointed towards a website where they could download the material either in final form or just get the XML to use it.

To some extent this is also about embracing a standard and making it common outside of academia and closed repositories. It would appear to be easier to share texts and make use of them if we know what the coding is going to be rather than have to wait for the download to complete before taking a look.

To that end, I’ve started a contribution (currently in prototype) to create a small parser so that we can start transforming text files into TEI (Text Encoding Initiative) Lite format. Granted it is at an early stage but the initial results show some promise and are encouraging (well for me at least).

As per Open Milton/Shakespeare, I’ve been using Python to do this with the minidom package with regular expressions. The next step will be to split out the script into  reader, parser and writer. I’ve been concentrating on drama but prose and verse have their own vocabularies so the parser will probably need to be split into three, each bit concentrating on a form and calling methods from the writer as appropriate.

Depositing blogs – feeding repositories from blogging applications

Thursday, March 19th, 2009

I’ve recently been working on a plugin for Wordpress to set up each post as RDF enabled using OAI_ORE and SWORD which I presented to the Oxon SWIG on Tuesday.

The Berlin Declaration of Open Access states the work should be free and also that it should be deposited in a repository. This seems to be about papers and articles but what about the use of blogs, wikis and even perhaps Twitter (might be a little stretch at the moment but I could see it being used)? That suggests a layer of data which could and, where practical, should be being archived in repositories as they are being used as open Laboratory notebooks with links to data.

The plug -in that I’m working on is designed to make blogs readable in RDF for the purposes of repository deposit.

At the moment, I have written a channel which lists all the blog’s posts (using the ?repository=site ) as well as individual post’s in RDF ( using ?repository=post&repository_id= postid). I’ve been using the SIOC exporter as the base model but I’m looking at using skos to get the categories and tags out of the Wordpress (and trying to leverage folksonomy through that). Next will be to look at the comments and trackbacks and using the isReferencedBy to export incoming links.

I’ve put this onto KnowledgeForge as its own project.