Archive for the ‘Text Mining’ Category

Letters of Charles Dickens website

Friday, September 18th, 2009

I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens.  The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.

I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF)  and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.

Mining the Letters of Charles Dickens

Tuesday, July 14th, 2009

As an aside I’ve started  a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.

Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.

Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.

Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.

I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.

Rethinking the idea of the “text”

Friday, May 22nd, 2009

Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything – email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?

Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.

However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?

If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?

Building data stores

Sunday, July 6th, 2008

Mats Dahlstrom’s talk at the Dilemmas of Digitization conference mentioned the Deep Sharing: A Case for the Federated Digital library paper by Daivd Seaman.

It would be great if there was a system for rapidly building small data stores from scratch to include texts and then have these with editing software components, text encoding output (RDF and TEI to share data easily electronically rather than expect users to have to re-enter key fields, such as bibliographic data).

Last weekend, I quickly hacked up a sample from Milton’s cry for free printing, the Areopagitica, and began to rdf some of the text. I think I’ve overegged the pudding as it were by adding SKOS (I was curious to see if you can adapt it to text documents but Dublin Core is a better fit). As I am using a just a few lines of text, I didn’t use the Rdf Api for PHP but hacked up a quick template using  a database behind it. I’ll be looking to re-write this at some point soon (as I will with the beginnings of an alternate spelling database to show that you could use SKOS to highlight any alternative or misspellings in a text).

Spelunking text data

Sunday, July 6th, 2008

One of the ARTFUL developers presented the PhiloLogic and its PhiloMine extension. Both are free text searching databases and tools. Both sets of code are designed for large sets of data which does raise the question whether it might be useful to develop a set of tools for smaller data holdings or individuals.