During the previously blogged dinner with Ben and Rufus, we talked about the nascent work on the letters project. Both have “encouraged” me (it didn’t take too much persuasion, it must be said) to move the project to the Open Knowledge Foundation and to port it to Python with a Redis backend rather than the current PHP/MySQL set up. I hope that the move will be complete soon.
Archive for the ‘Text Mining’ Category
A change to the Letters project
Sunday, March 28th, 2010Textcamp announced
Sunday, March 28th, 2010Had dinner with Rufus Pollock and Ben O’Steen on Monday in Oxford. As part of the dicussions, the notion of Textcamp was raised and Ben has created the Textcamp website with an associated blog. It is a slightly bigger concept than I had had but the approach, I think, will allow the creation of a wider community and a place to publicly follow up any ideas that get thrown up. I like the idea of hacking texts as well and it will be great to have a place to discuss ideas and to learn. Equally Ben’s post makes it clear that it should be friendly and helpful leading up to a Barcamp style event. It is slated to run in August or September. I can’t wait.
Mining data driving the web?
Wednesday, March 17th, 2010Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing:
the data age is less about the raw size of your data, and more about the cool stuff you can do with it. Now that there is so much data, it is time to unlock its value.
It seems fairly straight forward given the lower barriers to growth and tools to create and access data.
There are issues with this such as learnng how to best leverage these for the user and to gain most benefit. It’ll certainly be an interesting time and Cross identifies a few technologies and ideas which may or may not gain currency but will spark debate nonetheless.
Letters of Charles Dickens website
Friday, September 18th, 2009I’ve finally posted the first draft of the Dickens website here: http://austgate.co.uk/dickens/index.php?author=Dickens. The idea is that it will allow users to derive networks across the a variety of Victorian authors as and when I can develop the datasets.
I’ve also been developing a small text ontology to add to the Friend of a Friend (FOAF) and Dublin Core (DC) ontologies. I’ll post details later. The database schema is still under development but I hope to get that change done soon so that I can get on with the XML changes.
Mining the Letters of Charles Dickens
Tuesday, July 14th, 2009As an aside I’ve started a small project to begin visualising ways of searching the letters of Charles Dickens and exploring the Simile library which MIT have produced.
Its originally an extension to the D-Space repository tool but Rufus Pollock used in the Open Knowledge Foundation’s Weaving History project – to which I contributed the Milton json data file. Originally I’d used it just for biographical timelines but thinking about it, I wondered how you could use it to mine datasets like the letters of Charles Dickens.
Dickens was a prolific letter writer (the Pilgrim edition extends to 12 thick volumes). I don’t have access to that data but I did download the first volume (of three) that his daughters edited.
Using Perl, I have extracted the date and recipient tags and converted the text file into JSON (as part of a larger process of converting the file into XML and using XSL to transform the data) and then created a table view of the data so that you can easily find the dates of the letters sent to certain people in tabular form.
I’ve also used the same data set to produce a fairly basic timeline of the letters which is being rewritten from here. It needs some rewriting to update to the new version of timeline.
Rethinking the idea of the “text”
Friday, May 22nd, 2009Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything – email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?
Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.
However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?
If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?
Building data stores
Sunday, July 6th, 2008Mats Dahlstrom’s talk at the Dilemmas of Digitization conference mentioned the Deep Sharing: A Case for the Federated Digital library paper by Daivd Seaman.
It would be great if there was a system for rapidly building small data stores from scratch to include texts and then have these with editing software components, text encoding output (RDF and TEI to share data easily electronically rather than expect users to have to re-enter key fields, such as bibliographic data).
Last weekend, I quickly hacked up a sample from Milton’s cry for free printing, the Areopagitica, and began to rdf some of the text. I think I’ve overegged the pudding as it were by adding SKOS (I was curious to see if you can adapt it to text documents but Dublin Core is a better fit). As I am using a just a few lines of text, I didn’t use the Rdf Api for PHP but hacked up a quick template using a database behind it. I’ll be looking to re-write this at some point soon (as I will with the beginnings of an alternate spelling database to show that you could use SKOS to highlight any alternative or misspellings in a text).
Spelunking text data
Sunday, July 6th, 2008One of the ARTFUL developers presented the PhiloLogic and its PhiloMine extension. Both are free text searching databases and tools. Both sets of code are designed for large sets of data which does raise the question whether it might be useful to develop a set of tools for smaller data holdings or individuals.