Bibliographica – open bibliographic sourcing and maintenance

January 24th, 2010

Jonathan Gray of the Open Knowledge Foundation has a thought provoking post on the need for an Open Bibliographic Service which he calls Bibliographica. As he writes:

lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing. Books and articles are published and translated all the time. Works fall in and out of fashion. ‘Secondary’ reference works can become obsolete – considered interesting more for what they say about a particular intellectual period than what they say about their subject matter.

I’ve been working on my own book as an independent researcher and wanted to know common books and articles in the area. As a user I wanted to know what was published in a particular area and what the points of commonality are to identify key works. Jonathan’s idea would be a help for this and, perhaps more importantly, provide a shared platform form.

As he identifies, sites like Amazon and LibraryThing allow for the user to create lists of books but over time, fashions change and books fall into and out of favour. Being able to compile searchable, sortable lists would allow the user to develop comprehensive lists (and also allow the intellectual historian to figure out zeitgeist’s from lists) and also realise the web’s potential for knowledge sharing which should go beyond mere surfing and into finding the source material and perhaps surprising links between data sets.

His specification, I think, offers a fertile starting point. It appears to source from and link to existing sources rather than re-invent the wheel and to also use existing technologies and ontologies like MARC and Dublin Core. I think that the specification is also sensible in its identification of users and groups to create and edit lists. It mentions that the service could be run by individual universities but what would be extremely useful (but perhaps would not happen) if these silos could then link to each other via interfaces to create continually updated communal resources rather than being individual silos.

Perhaps this is a slightly off topic thought but I’d love to know which books referred to each other, so that we could examine whether Foo writing Bar read the book by Baz which would be an indicator of influence.

The Bibliographica idea mixes “traditional” scholarship with crowd sourcing and is a sensible and potentially useful idea and service. I think it would need to build a critical mass of data and sources to be really useful but it could encourage use of resources.

UPDATE: Just one of those thoughts I had whilst making some lemon tea. Actually one of the challenges would be normalising the data sources to update the sources and pull in from the external sources.

Full text search using PHP and MySQL

December 29th, 2009

I’ve been thinking about full text searching for the letters project and trying to find various solutions that are open source. On the Open Shakespeare and Open Milton sites, we used the Xapian  project which is an excellent search engine. However I wanted to try and find a way of getting a search running using PHP and MySQL which is what the site uses at the moment although I’d be happy to also use Perl. (I also wanted to impose a limit to use technologies that I currently use at my current job.)

I started with reading an article  on the Zend site that offers an overview of setting up a table to run with a Full-Text index. As the article mentions, you have to ensure that the column which is being searched is either VARCHAR or TEXT as MySQL. If it is not in either form, then just alter the column using

ALTER TABLE <tablename> MODIFY <column> TEXT

(or VARCHAR but TEXT is probably preferable). What the Zend article does not mention is that the table type needs to by MyISAM rather than use InnoDB (which means that transactions won’t work on the table). Having made the alteration, I ran the query:

SELECT * , MATCH (<table>) AGAINST (‘<search term>’) AS score FROM <table> WHERE MATCH (<table>) AGAINST (‘<search term>’)

The table returns all the columns with a score against the term.

The SQL code just needs calling as you would any other form of database code. I’m still playing with this but I’ve been ordering the table by the score descending (ORDER BY score DESC) so that the most relevant results are posted for the user.

I do think that I need to do some pre-processing on my own results set to highlight relevance and to extract further semantic meanings for results. For example the publisher ‘Chapman and Hall’ that I could run on the Dickens letters (http://austgate.co.uk/dickens/search.php?term=Chapman&submit=Submit+Query) could equally pull up other businesses or people. I still need to write a parser that can make some sort of judgement even if it is a guess.

I’m sure as I carry on developing the engine and bringing everything together for the project, I’ll have further thoughts on the creation of an engine and creating a more advanced version. This does at least give me a start using current tools (though it is perhaps not as good as Xapian but sometimes you have to at least learn some of the basics).

Making the web pragmatic?

November 22nd, 2009

ReadWriteWeb has an intriguing guest post by Alisa Leonard-Hansen on the the idea of the Pragmatic Web. She takes a sanguine look at the Semantic Web and the fact that it is going to take time to build the machines and networking to fully mine the contextual information that will appear.

She explores the way that social relationships can be mined re-presented by individuals and companies to find the context for the media companies.

There’s something about the focus on the use of identity data by Facebook and the fact that it is only of use if it is immediate that concerns me. I’m more interested in literary data and how to work with this in ‘pragmatic’ ways and I cannot see a place for my voice as these technologies, and their underlying agendas, appeared to be guided by the media companies or at least most vociferously guided by them. Certainly in terms of advertising, making older data ‘pragmatic’ is a loser but in the long term, I think that there is a value to it and creating linked data sets.

Now that some personal projects have come to a temporary end, or at least  a needed hiatus before the next version, I’ve got a little more time to explore this and to do more work on Dickens.

Update on the Letters of Dickens

November 22nd, 2009

Just started on a new version of the Dickens letters which I’m trying to improve before adding in further volumes of text and other authors.

I’ve refactored some of the code to remove some of the cruft and obsolescence. I’ve also been working on the rdf so that I can build up the RDFa links for each letter.

This will be linked to the full text search  of the letter text that I’m going to explore using MySQL (which appears to be Xapian like in some parts). It is only going to be  a first stop as I think that further processing might well be needed to make the links more explicit and the search more relevant. Might well look into increasing the search possibilities for finding letters.

In the future, I’m going to look into annotation bits and pieces and software.

Kirby’s heirs seeking copyright extension for Marvel characters

September 21st, 2009

Just caught this story on the Guardian culture page about the heirs of Jack Kirby seeking to extend the copyright on the Marvel characters that he co-created with Stan Lee. From what I understand, comic copyrights appear to be fairly complicated (certainly more so than book publishing) and perhaps it is an issue that needs to be opened up and simplified.

Given the recent heckling over Google Books and publishing, it seems that this rolling issue will carry on into a new area.

Letters of Charles Dickens website

September 18th, 2009

I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens.  The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.

I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF)  and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.

Mining the Letters of Charles Dickens

July 14th, 2009

As an aside I’ve started  a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.

Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.

Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.

Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.

I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.

Twittering RSS

July 13th, 2009

The slowness or lack of real time on RSS feeds has reared its head again in terms of getting news out quickly and in “real-time”. Erick Schonfeld on Techcrunch wants to speed them up and  John Biggs has decided that RSS needs to RIP.

I’ve been working on Twittering RSS feeds for the JISCMail service and getting the service news feeds to become tweets using Perl using XML::FeedPP and LWP::UserAgent. I’ve even got a script reading Twitter and posting back any posts from the account to an email address so that the helpline doesn’t need to constantly log into update itself.

Clearly RSS on its own is not going to help with the constant stream of news attention required by some users. It does for most people I suspect who are not running in real time but messaging systems on the web are changing and it is getting faster which perhaps demands a rethink  of how silos, like Twitter and Facebook, and protocols, like RSS, work together.

I noticed that the pubsubhubub solution that Erick points to builds on Atom and pushes via an IM style solution. Andy Skelton at Wordpress has developed a Jabber plug in (which I suppose goes some way to alleviating the problem but only for Wordpress).

Pushing content and transforming it into a different protocol is the easiest way currently to make sure that news or events are ported into different services and that the community can be developed. Building and updating communities has never been easier or frustrating at the same time trying to see how the different services talk to each other and how to build “real-time” update when necessary.

Rethinking the idea of the “text”

May 22nd, 2009

Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything – email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?

Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.

However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?

If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?