Update on the Letters of Dickens

November 22nd, 2009

Just started on a new version of the Dickens letters which I’m trying to improve before adding in further volumes of text and other authors.

I’ve refactored some of the code to remove some of the cruft and obsolescence. I’ve also been working on the rdf so that I can build up the RDFa links for each letter.

This will be linked to the full text search  of the letter text that I’m going to explore using MySQL (which appears to be Xapian like in some parts). It is only going to be  a first stop as I think that further processing might well be needed to make the links more explicit and the search more relevant. Might well look into increasing the search possibilities for finding letters.

In the future, I’m going to look into annotation bits and pieces and software.

Kirby’s heirs seeking copyright extension for Marvel characters

September 21st, 2009

Just caught this story on the Guardian culture page about the heirs of Jack Kirby seeking to extend the copyright on the Marvel characters that he co-created with Stan Lee. From what I understand, comic copyrights appear to be fairly complicated (certainly more so than book publishing) and perhaps it is an issue that needs to be opened up and simplified.

Given the recent heckling over Google Books and publishing, it seems that this rolling issue will carry on into a new area.

Letters of Charles Dickens website

September 18th, 2009

I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens.  The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.

I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF)  and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.

Mining the Letters of Charles Dickens

July 14th, 2009

As an aside I’ve started  a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.

Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.

Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.

Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.

I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.

Twittering RSS

July 13th, 2009

The slowness or lack of real time on RSS feeds has reared its head again in terms of getting news out quickly and in “real-time”. Erick Schonfeld on Techcrunch wants to speed them up and  John Biggs has decided that RSS needs to RIP.

I’ve been working on Twittering RSS feeds for the JISCMail service and getting the service news feeds to become tweets using Perl using XML::FeedPP and LWP::UserAgent. I’ve even got a script reading Twitter and posting back any posts from the account to an email address so that the helpline doesn’t need to constantly log into update itself.

Clearly RSS on its own is not going to help with the constant stream of news attention required by some users. It does for most people I suspect who are not running in real time but messaging systems on the web are changing and it is getting faster which perhaps demands a rethink  of how silos, like Twitter and Facebook, and protocols, like RSS, work together.

I noticed that the pubsubhubub solution that Erick points to builds on Atom and pushes via an IM style solution. Andy Skelton at WordPress has developed a Jabber plug in (which I suppose goes some way to alleviating the problem but only for WordPress).

Pushing content and transforming it into a different protocol is the easiest way currently to make sure that news or events are ported into different services and that the community can be developed. Building and updating communities has never been easier or frustrating at the same time trying to see how the different services talk to each other and how to build “real-time” update when necessary.

Rethinking the idea of the “text”

May 22nd, 2009

Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything – email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?

Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.

However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?

If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?

Cory Doctorow on Creative Commons licensing

May 14th, 2009

Cory Doctorow has come up with a quick guide to self-serve licensing via Creative Commons which outlines the uses and advantages of the licence. The crux, apart from citation of sources, is what it allows users to do to use your data/craft/book/doohickey in innovative ways. From that both parties can learn from each other and possibly evolve new techniques or models. It is certainly a useful guide.

The changing community of publishing

May 13th, 2009

The New York Times had a piece on digital piracy of books and the contrasting views which was picked up by Slashdot. Starting out from the anti-piracy view, it does note that bestsellers are often the most pirated books which backs up Cory Doctorow‘s assertion:

“I really feel like my problem isn’t piracy,…It’s obscurity.”

His own position of publishing free digital copies at the same time as the paid for “treeware” version comes out has helped the all important word of mouth get about his books. He has built a passionate community around his work who both download and pay for books. Through his acknowledgement that there will be cheap skates who will only download the free version but encouraging the rest of the community to be involved in discussing  and remixing his work, his latest novel stayed in the NY Times bestseller for seven weeks.

There must, however, be an acknowledgement that the creator has rights to the work. Doctorow uses Creative Commons to protect his original work but to allow users certain rights to do something with the work. The Open Definitions also do this. Through a simple transformation of rights as open shops rather than closed, i.e. changing to saying what you can do, rather than what you cannot, could change publishing and how it reacts to piracy.

So perhaps publishers need to accept that there will always be a certina amount of it going on. However they should not see piracy as open (it’s not and never will be). The challenge, I believe, for publishers is how to digitise and make available works to a community and allow the community to do things with the books and find new markets and models that way.

The transition would be rough and mistakes made but they need to happen. Publishing needs to learn the lessons of iTunes rather than seeing the digital world as Napster.

It would be great to link into publisher versions of books to create citations or from which to construct models in blogs and wikis using community licenses. It would allow for publisher works to be re-used, ideally be open but perhaps operate on micro-payments based on traffic or level of citation, and for the user to have some authentication (or not depending on publisher) of the data as coming from a reliable source.

Just a thought but the time is ripe for change and experimentation.

XML in Milton and Shakespeare

April 22nd, 2009

As part of the Open Milton project, I’ve been thinking about the place of  XML in it. Over Christmas, I wrote a small XSL transform using the Bosak XML Shakespeare files. Rufus took Anthony and Cleopatra and,  using Latex (I gather), created the Open Shakespeare Anthony and Cleopatra pdf.

At one level, this is yet another version of Shakespeare. True.

But think of the possibilities. A user could happily generate their own version of the play (for instance using it in a class) or create their own annotated version for that class and not have to worry too much about losing the text / book as it can be printed and shared widely. Communities of interested parties could be pointed towards a website where they could download the material either in final form or just get the XML to use it.

To some extent this is also about embracing a standard and making it common outside of academia and closed repositories. It would appear to be easier to share texts and make use of them if we know what the coding is going to be rather than have to wait for the download to complete before taking a look.

To that end, I’ve started a contribution (currently in prototype) to create a small parser so that we can start transforming text files into TEI (Text Encoding Initiative) Lite format. Granted it is at an early stage but the initial results show some promise and are encouraging (well for me at least).

As per Open Milton/Shakespeare, I’ve been using Python to do this with the minidom package with regular expressions. The next step will be to split out the script into  reader, parser and writer. I’ve been concentrating on drama but prose and verse have their own vocabularies so the parser will probably need to be split into three, each bit concentrating on a form and calling methods from the writer as appropriate.