Archive for the ‘Open Knowledge’ Category

Digitising books and mumblings on open literature

Sunday, February 14th, 2010

Robert McCrum, an associate editor of the Observer, has this remarkably sane blog post regarding the nature of digitisation and Google Books. Perhaps it is only my interpreation but it does seem to be a slight volte face on his part, as I’ve always interpreted his stance as slightly anti-digitised books.

Having read Adrian Johns’ book Piracy, he considers that

From print culture’s beginnings to the rise of the internet, there has been a succession of intellectual property wars for which the English language has just one word: piracy.

There is a temptation to see digitisation and free culture as piracy. It is certainly easier in the short term to do so but that ignores the wider issues of remix and re-use. Print culture was no stranger to piracy. McCrum uses the second book of Don Quixote where the errant knight escapes from the unauthorised translations and re-uses in a print culture which was far more limited than it is now (although one which was broadening through the use of printing).

Google Books has just stepped in and is doing what publishers and libraries should have been doing: making books available for a new medium. Granted this effort will be expensive (I would have thought) but in the long run it makes sharing and reusing knowledge easier and more useful. This aids intellectual thought and allows ideas and research to be more rigorous.

I’m not sure I share McCrum’s equation of free culture with piracy. He writes that

sometimes piracy can be an engine of social and intellectual innovation as much as it has been an enemy of authors’ rights.

We need to come to a better cultural and social understanding of piracy and free culture. I do believe in sharing and reuse but not piracy in terms of stealing. Google Books is in the vanguard of this understanding but believers in open literature should be working to extend this and ensuring that we do not swap one hegemony for another.

In his piece on the Guardian’s Comment is Free David Drummond, Google’s senior vice president for corporate development and chief legal officer, wrote:

If you love books and care about the knowledge they contain, there is a problem that needs to be solved. Somewhere in the region of 175m books exist in the world today. A tiny fraction of those are in print and for sale in bookshops or on the web. ­Another small portion are so old that they are out of copyright and anyone can use them.

This is the core problem and Google are indeed to be praised for doing this and, to a certain extent, publishers and libraries damned. But it is a question of expense and Google have the funds and technical expertise whilst libraries may not.

They are indeed trail blazing and it is easy to write jeremiads against this (I’m aware that this blog post might come across as this). I like the fact that books are being made available and appreciate that there is a cultural clash and legal issues which needs to be dealt with such as orphan works and monetisation for the concerned parties. All parties need to address these issues. I doubt that there will be a mechanism for orphan works which satisfies every one and that it will need to be adapted for each legal jurisdiction but motion towards is better than no motion at all.

Drummond writes:

The truth is that readers around the world who seek the information locked in millions of out-of-print books currently have little choice other than to travel to a small number of libraries in the hope of finding what they are looking for. And if you’re an author, you have no way to make money from your work if it’s out of print.

That paragraph brings up at least two points. There are a small number of academic research libraries like the Bodleian or Cambridge University Library and independent researcher not attached to academic institutions can be locked out of them. There is the marvellous (and this has meant that I’ve had access to several books that othersie I could not get hold of) inter-library loans service that local libraries can offer. When I went to university, one of the first things I was advised to do was to register with the local library. At the time I didn’t use it a huge amount but now… They are lifelines and also help the use case argument for libraries, i.e you use them, you keep them.

How do we get ourselves out of the quagmire that we begin to see ourselves in? At the moment, I don’t know but I’m trying to stumble towards some sort of answer.

Firstly we need to break out the idea that free/remix/reuse culture = piracy. It does not, Mr McCrum. Piracy is perhaps one subset or occasional union but it does represent one view. Of course piracy exists and it always will.

Secondly we need a wider debate as users and providers as to what we want.

Thirdly I think we need to experiment with models. Google provides one way of doing it but not, I think, the only one.

Varsity article on Open Shakespeare

Friday, February 12th, 2010

I’ve just come across this Varsity article on the Open Shakespeare project which the Open Knowledge Foundation run (and I did a bit of porting of for Open Milton).

I got involved in other things like the Dickens project and sidetracked that way but the original project has received a second wind.

Bibliographica – open bibliographic sourcing and maintenance

Sunday, January 24th, 2010

Jonathan Gray of the Open Knowledge Foundation has a thought provoking post on the need for an Open Bibliographic Service which he calls Bibliographica. As he writes:

lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing. Books and articles are published and translated all the time. Works fall in and out of fashion. ‘Secondary’ reference works can become obsolete – considered interesting more for what they say about a particular intellectual period than what they say about their subject matter.

I’ve been working on my own book as an independent researcher and wanted to know common books and articles in the area. As a user I wanted to know what was published in a particular area and what the points of commonality are to identify key works. Jonathan’s idea would be a help for this and, perhaps more importantly, provide a shared platform form.

As he identifies, sites like Amazon and LibraryThing allow for the user to create lists of books but over time, fashions change and books fall into and out of favour. Being able to compile searchable, sortable lists would allow the user to develop comprehensive lists (and also allow the intellectual historian to figure out zeitgeist’s from lists) and also realise the web’s potential for knowledge sharing which should go beyond mere surfing and into finding the source material and perhaps surprising links between data sets.

His specification, I think, offers a fertile starting point. It appears to source from and link to existing sources rather than re-invent the wheel and to also use existing technologies and ontologies like MARC and Dublin Core. I think that the specification is also sensible in its identification of users and groups to create and edit lists. It mentions that the service could be run by individual universities but what would be extremely useful (but perhaps would not happen) if these silos could then link to each other via interfaces to create continually updated communal resources rather than being individual silos.

Perhaps this is a slightly off topic thought but I’d love to know which books referred to each other, so that we could examine whether Foo writing Bar read the book by Baz which would be an indicator of influence.

The Bibliographica idea mixes “traditional” scholarship with crowd sourcing and is a sensible and potentially useful idea and service. I think it would need to build a critical mass of data and sources to be really useful but it could encourage use of resources.

UPDATE: Just one of those thoughts I had whilst making some lemon tea. Actually one of the challenges would be normalising the data sources to update the sources and pull in from the external sources.

Making the web pragmatic?

Sunday, November 22nd, 2009

ReadWriteWeb has an intriguing guest post by Alisa Leonard-Hansen on the the idea of the Pragmatic Web. She takes a sanguine look at the Semantic Web and the fact that it is going to take time to build the machines and networking to fully mine the contextual information that will appear.

She explores the way that social relationships can be mined re-presented by individuals and companies to find the context for the media companies.

There’s something about the focus on the use of identity data by Facebook and the fact that it is only of use if it is immediate that concerns me. I’m more interested in literary data and how to work with this in ‘pragmatic’ ways and I cannot see a place for my voice as these technologies, and their underlying agendas, appeared to be guided by the media companies or at least most vociferously guided by them. Certainly in terms of advertising, making older data ‘pragmatic’ is a loser but in the long term, I think that there is a value to it and creating linked data sets.

Now that some personal projects have come to a temporary end, or at least  a needed hiatus before the next version, I’ve got a little more time to explore this and to do more work on Dickens.

Update on the Letters of Dickens

Sunday, November 22nd, 2009

Just started on a new version of the Dickens letters which I’m trying to improve before adding in further volumes of text and other authors.

I’ve refactored some of the code to remove some of the cruft and obsolescence. I’ve also been working on the rdf so that I can build up the RDFa links for each letter.

This will be linked to the full text search  of the letter text that I’m going to explore using MySQL (which appears to be Xapian like in some parts). It is only going to be  a first stop as I think that further processing might well be needed to make the links more explicit and the search more relevant. Might well look into increasing the search possibilities for finding letters.

In the future, I’m going to look into annotation bits and pieces and software.

Kirby’s heirs seeking copyright extension for Marvel characters

Monday, September 21st, 2009

Just caught this story on the Guardian culture page about the heirs of Jack Kirby seeking to extend the copyright on the Marvel characters that he co-created with Stan Lee. From what I understand, comic copyrights appear to be fairly complicated (certainly more so than book publishing) and perhaps it is an issue that needs to be opened up and simplified.

Given the recent heckling over Google Books and publishing, it seems that this rolling issue will carry on into a new area.

Letters of Charles Dickens website

Friday, September 18th, 2009

I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens.  The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.

I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF)  and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.

Mining the Letters of Charles Dickens

Tuesday, July 14th, 2009

As an aside I’ve started  a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.

Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.

Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.

Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.

I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.

Rethinking the idea of the “text”

Friday, May 22nd, 2009

Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything – email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?

Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.

However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?

If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?

Cory Doctorow on Creative Commons licensing

Thursday, May 14th, 2009

Cory Doctorow has come up with a quick guide to self-serve licensing via Creative Commons which outlines the uses and advantages of the licence. The crux, apart from citation of sources, is what it allows users to do to use your data/craft/book/doohickey in innovative ways. From that both parties can learn from each other and possibly evolve new techniques or models. It is certainly a useful guide.