Posts Tagged ‘rdf’

BBC’s use of Semantic Web technology in World Cup

Tuesday, July 13th, 2010

Just caught this story on ReadWrite Web about the BBC website’s use of semantic web technology during the World Cup.  Jem Rayfield explains more on the BBC Internet blog about the use of technology.

I’ve still got a fair amount of reading to do but this is the sort of project that makes me rethink the Open Letters project and how it could be used by other sites. It has also given me food for thought for work as well.

Weeknotes: Data mining, XML and bibliographies

Sunday, May 23rd, 2010

It seems to be have been a week of frantic completion and refactoring.

The first half was spent frantically converting html pages into PDFs using Verypdf’s HTMLtools server product. All in all the manual is very helpful and the test server could be set up quickly. It might have helped the other end if I’d remembered to break the file up for printing but that turned out to be a 10 minute jub to put back into production. The next task is to transfer it from the test server and onto the production one but that’ll need to wait for networking to tweak it a little.

I spent some time refactoring the call recordings archive. For some reason the archiving solution that I hacked up in November decided to start failing in March after it was changed. Despite being put back to its original state it never quite got back to working as it did. I’ve been trying to tweak it ridon and off but never found the time to complete it. I finally just made the time on friday afternoon to look at it properly. I’d been thinking about item based filtering after reading the first chapter of Toby Segaran’s Programming Collective Intelligence. (On the back of this, I think I’ll be buying his Beautiful Data at some point.)  Although this is not really an intelligent programme as such, the techniques have shown some real promise in the hurried tests. Using a Redis datastore, the percentage of found recordings is way up. Fingers crossed for Monday morning when I can see what the scripts run over the weekend. I also spent some time simplifying the matching algorithm so that I didn’t have to account for so many edge cases when dealing with time.

It seems that we are approaching some sort of real-time status update systems at work. I’ve sort of been arguing for this for a while to remove the bottlenecks of having each system dependant on another one. One of our suppliers is sending us XML data so I’ve been playing with Xpath 1.0 (since Xpath 2.0 apparently isn’t directly supported by PHP but there might be a way of passing the data to Java which adds unnecessary overhead) to extract the relevant values. Anyhow the core is running but I still need to fully test it and add in security.

I’ve also been asked to design and implement a queueing system for the main internal server. I’ve run up a quick high level overview but the detail still needs to be worked on. I’m pushing it back to June so that I can slear the decks of the older projects that are still on the board.

I had a chat with Jonathan Gray, a sound guy who does far too much, about digital humanities ideas. We’ve agreed to keep closer contact with each other about the area and to encourage each other into actually doing stuff (I have half a moleskin of ideas – time for more code, less talk then).  He proposed the Bibliographica idea in January and the team wrote a blog entry for the Open Knowledge Foundation blog. It is an idea that I’m looking forward to playing with and trying to embed data from. (http://bibliographica.org/)

One of the things that I’ve been thinking about though is increasingly when we do research, we store  web pages, blog entries and so on. Whilst there is way of recording these in a footnote (http:example.org accessed on <insert data> type thing), there does not appear to be a way of building a local archive of these with the relevant metadata for later retrieval, Don’t know about anybody else but I’ve got a fair few pages dotted around my hard drive for projects and I’d like a way of storing these properly and to be able to integrate them into bibliographies or research notes. I know that there is WARC format (Library of Congress link and the WARC tools Google code project) to play with so need to make time to do that.

I had a mini-hack on the Open Correspondence project last Sunday intending to update a couple of pages and got a little more done than that. The database needs rebuilding but the purl reference (http://purl.org/letter) now points to the schema. It is so close that I can’t wait to actually start hacking the data. Time to do the last little bits like tidy up the parser, use the weaving history API to embed a timeline and start using JENA, ARC and Chris Gutteridge’s Graphite library which worked out of the box (but as yet I haven’t entirely used it for much yet).

Goals for this week are to finish the Open Correspondence bits, update the trac instance with the various ‘todo’s, write a blog post for the Open Knowledge Foundation for Open Correspondence, do some major testing this week at work on various XML exports and imports. I should just be about caught up then. With any luck…

Exporting and querying Dickens data

Sunday, March 21st, 2010

As a follow up to the posting regarding the propsed ontology, I’ve started to try and create a SPARQL endpoint. At some point soon, I want to use the new version of ARC as the version I’ve got here is a little out of date. After that the next thing should be to allow the endpoint to be converted into other forms like JSON.

UPDATE: I’ve created an endpoint using the default ARC settings here: http://austgate.co.uk/dickens/endpoint.php

Creating the text ontology

Thursday, March 18th, 2010

I’ve been working quietly on ideas for an ontology to describe relationships in  a letter from the correspondent to people referred in the text. It is intended to complement and extend the Dublin Core and Foaf (Friend of a Friend) namespaces. Anyhow I’ve decided to publish a first set of thoughts on it having sat on the project for a while.I’ve sort of thought of it as using the text namespace in the text, which I currently doing, but it is not set in stone.

Simple Ontology for Relationships in Texts

Text namespace

austgate.co.uk/ontology/text

Definition: An ontology which allows for the linking text items, such as letters, together. It extends and complements Dublin Core (DC) and Friend of a Friend (FOAF).

Terms

Appearsin

The term is used to denote a work in which a character appears. For example:
Dear Alice,

As you may know I am coming to the end of the latest draft of the Ponsonby diaries. Bob Ponsonby is making his way across the marshes…

The character Bob Ponsonby could be referenced as text:Appearsin to denote his appearance in the work. This allows queries to find documents where the characters from a work appear, rather than just individual characters. It would usually be considered as a collection of text:Character references.

Character

A fictional person who is referenced in the text. This element is used to disambiguated between fictional and non-fictional characters. Non-fictional, i.e. real people, are denoted by foaf:Person. Character is a subset of foaf:Person and is intended for fictional people. For example, in a letter from an author to an agent, the author may describing their latest project.

Dear Alice,

As you may know I am coming to the end of the latest draft of the Ponsonby diaries. Bob Ponsonby is making his way across the marshes…

In the example, Alice is a real person and could be denoted as such by using foaf:Person but Bob Ponsonby is equally a name and a person. Since he is fictional in this letter, he could be denoted as  text:Character in any RDF representation to allow users to link documents where the character is mentioned.

<text:character
rdf:ID=”http://austgate.co.uk/Dickens/characters/pickwick”>
<foaf:name>Mr. Pickwick</foaf:name>
<text:appearsin
rdf:resource=”http://austgate.co.uk/Dickens/works/pickwickpapers” />
</text:character>

Correspondent
This field denotes the correspondent of the letter.  It is a subset of foaf:Person as it should denote a real person. (However it is perfectly possible for a fictional letter to be written and in this case it would perhaps be inappropriate to use foaf:Person).

textReferred
This refers to a text (book, verse or similar) which is referred to in the letter being serialised. It is intended to allow the building of graphs between the letters where a text is being referred to so that a graph can be built of what an author was doing or thinking about a text around the time or after writing the text. It is designed to allow for some contextualisation of the referred work. It could also be used to build a reading list, possible influences or forgotten works that the author was aware of at the time.
Work

The term denotes a type of text, in this case a book. It would be a collection of Dublin Core terms.
<text:work rdf:ID=”http://austgate.co.uk/dickens/work/pickwick”>
<dc:title>Pickwick Papers</dc:title>
<dc:author
rdf:resource=”http://austgate.co.uk/dickens/people/CharlesDickens”>
<dc:publisher>Chapman and Hall</dc:publisher>
</text:work>

I’m still working on applying some of this to my letters project (which sort of came about because and from the curiosity about the idea). Many thanks to Brian Matthews of the e-Science department of the STFC but any mistakes or oversights are entirely mine.