<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Aust Gate &#187; Text Mining</title>
	<atom:link href="http://austgate.co.uk/category/openknowledge/text-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://austgate.co.uk</link>
	<description>Open Knowledge and Literature</description>
	<lastBuildDate>Tue, 08 May 2012 20:33:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Thinking about texts and communities at Textcamp</title>
		<link>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/</link>
		<comments>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/#comments</comments>
		<pubDate>Sun, 14 Aug 2011 12:33:01 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[open_literature]]></category>
		<category><![CDATA[textcamp]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=378</guid>
		<description><![CDATA[Having gone to Textcamp yesterday, I started playing with Wordle and IBM&#8217;s Many Eyes at the suggestion of Dave Flanders of the JISC. As James Harriman-Smith, the organiser and Open Literature co-ordinator for the Open Knowledge Foundation, had suggested that this year is the anniversary of the manuscript of Alexander Pope&#8216;s An Essay in Criticism, [...]]]></description>
			<content:encoded><![CDATA[<p>Having gone to <a title="Textcamp on Open Literature" href="http://wiki.openliterature.net/Text_Camp_2011" target="_blank">Textcamp</a> yesterday, I started playing with Wordle and IBM&#8217;s Many Eyes at the suggestion of <a title="David Flanders JISC staff page" href="http://www.jisc.ac.uk/contactus/staff/davidfflanders" target="_blank">Dave Flanders</a> of the<a title="JISC website" href="http://www.jisc.ac.uk/" target="_blank"> JISC</a>. As <a title="James Harriman-Smith's OKF page" href="http://okfn.org/members/jameshs/" target="_blank">James Harriman-Smith</a>, the organiser and Open Literature co-ordinator for the Open Knowledge Foundation, had suggested that this year is the anniversary of the manuscript of <a title="Wikipedia on Alexander Pope" href="http://en.wikipedia.org/wiki/Alexander_Pope" target="_blank">Alexander Pope</a>&#8216;s <a title="Wikipedia on Essay on Criticism" href="http://en.wikipedia.org/wiki/An_Essay_on_Criticism" target="_blank">An Essay in Criticism</a>, I popped the Gutenberg text into Wordle to see what it <a title="Wordle on Pope's Essay in Criticsm" href="http://www.wordle.net/show/wrdl/3912697/Essay_in_Criticism" target="_blank">shows as a tag cloud</a>. <a title="Wordle: Essay in Criticism" href="http://www.wordle.net/show/wrdl/3912697/Essay_in_Criticism"><img style="padding: 4px; border: 1px solid #ddd;" src="http://www.wordle.net/thumb/wrdl/3912697/Essay_in_Criticism" alt="Wordle: Essay in Criticism" align="left" /></a> The dominance of wit is not a surprise as Wit in poetry was a prized quality for Pope and Dryden. There are some small issues such as &#8216;still&#8217; and &#8216;Still&#8217; and perhaps this could be rectified by making everything lower case but this also presents other issues if two words are similar but the capitalisation suggests a different intonation. As I&#8217;ve <a title="Post on Word clouds" href="http://austgate.co.uk/2010/10/tagging-the-revolution-exploring-edmund-burkes-reflections-on-the-revolution-in-france/" target="_blank">blogged before</a>, word clouds are great but not if they don&#8217;t link so, at some point in the future, I&#8217;ll sit down and actually upload a table to create a useful tag cloud. John Levin, of <a title="James Levin's blog onAnterotesis on Ecco" href="http://anterotesis.com/wordpress/2011/08/making-the-tcp-ecco-texts-accessible/" target="_blank">Anterotesis</a>, loaded a csv file of the recently released ECCO files. He loaded Volume Four of Defoe&#8217;s Tour of the Whole Island of Great Britain, which features Scotland.</p>
<div id="attachment_383" class="wp-caption alignleft" style="width: 190px"><a href="http://austgate.co.uk/wp-content/uploads/2011/08/oenvq.jpg"><img class="size-medium wp-image-383" title="Wordcloud of Defoe's journey" src="http://austgate.co.uk/wp-content/uploads/2011/08/oenvq-180x300.jpg" alt="Wordcloud of Defoe's journey taken at Textcamp by Dave Flanders" width="180" height="300" /></a><p class="wp-caption-text">Wordcloud of Defoe&#39;s journey taken at Textcamp</p></div>
<p>Using the Many Eyes Word Cloud, we can see that Scotland is unsurprisingly the largest item but also Lord and Earl are also popular, suggesting that he stopped with or met the aristocracy rather than just travelling randomly. Dave Flanders and John created some cool visualisations using the tool which allow you to follow words in the text and to see which are the most linked to words (using bigrams I would suppose) in a tree fashion. It is certainly something at I will be looking up later for &#8220;quick win&#8221; visualisations.</p>
<p>One of the intriguing projects that was suggested was building our own DIY bookscanner using links currently stored on the <a title="DIY Bookscanner" href="http://wiki.openliterature.net/Tcamp11/DIYD" target="_blank">Textcamp 2011 wiki pages</a>. I think that Dave Flanders might be organising a hack weekend to actually build the machine for real use. I find it interesting but thinking that it would be cool to also see if can be built at home or using iPhone / Android OSes which also entails a software hack, unless an app already exists. That is something to explore later.</p>
<p>Mark MacGillivray, of OKFN and <a title="Cottage Labs" href="http://cottagelabs.com/" target="_blank">Cottage Labs</a>,  and Brian Hole of <a title="Ubiquity Press" href="http://www.ubiquitypress.com/" target="_blank">Ubiquity Press</a>, spoke about Open Access and making scholarship open but also retaining its rigour. Using Open Access, we should be able to share the data, the ways of interpreting it and and the final interpretation which is published.</p>
<p>The science community has been doing this for some while and things like the Panton Principles and Science Commons are showing the way. One of the ideas was to write a handbook for how to use openness in literature and that it is something that we need address and build on. We ought to write an open guide / manual and build on / develop the Panton Principles where necessary as a core set of principles to work with.</p>
<p>Having days like Textcamp and Book Hackday are extremely useful to think about this and to work on the ideas. It is easy to get into echo chambers of mailing lists and blogs, we need these events to meet new people, be challenged to explain ourselves and to either build on the day or go away with ideas to test and try out. The day has excited me out using word clouds again and doing a bit more work on them as a tool to make them useful. It has also got me excited about book scanning and doing some hardware hacking (which I&#8217;ve not really done) before.</p>
<p>Running the Pope essay through Wordle makes me excited about testing what we can do with the ECCO TEI documents that John Levine  links to. Can we hyperlnk to other texts, author and events that are mentioned in it (not just with the annotator tool but in generated HTML) or use HTML 5 to embed audio links to further discussions or pronunciation (for example Byron&#8217;s Don Juan which has been argued as pronounced &#8220;Jew-an&#8221; rather that &#8220;Hwan&#8221; and the arguments for and against).</p>
<p>Perhaps that gets to one of the issues that arose in the break-out discussions in the kitchen. After the lightning talk about digital publishing, there seemed to be an argument about whether current digital publishing was really pushing the boundaries or flailing around. I do think that it has some real benefits for niche publishing but these have not been fully explored. The model will need to change and perhaps become more open in those senses, perhaps linking the raw data to the interpretation earlier to allow the relevant community to peer review the data earlier. Just a suggestion. There are two distinct communities, the top-down business layer and the grass roots layer, activists, data developers and so on. Both would appear to have broadly similar aims but how to put them together  in a useful way for both to learn. Don&#8217;t get me wrong here as I believe I&#8217;m at the grass roots layer, but I think that both sides do have a dialogue which could get around the issues that the music and film industries have found themselves in, i.e. confrontation. We are here to disrupt and make.because we are passionate.  We care about the industry. Publishing is an industry which needs to change and transform itself. Put the two together and there are ways of moving forward. My hope is that in future events, we could get some more publishers along to the event.</p>
<p>The other important thing is that these conversations carry on afterwards. The round table discussions where great as were the break-out in the kitchen ones but they need to carry on or we create our own echo chamber which reduces the value of what happened yesterday.</p>
<p>Whilst I did not do as much coding as I wanted to yesterday, I met some new people and caught up with colleagues. The fact that organisations such as JISC are supporting events like this shows their underlying importance and use to the community. We&#8217;ve started, now we need to carry on by chatting, blogging, sharing and doing more of these events.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Open Correspondence toolkit and converting XML into JSON</title>
		<link>http://austgate.co.uk/2011/05/weeknotes-open-correspondence-toolkit-and-converting-xml-into-json/</link>
		<comments>http://austgate.co.uk/2011/05/weeknotes-open-correspondence-toolkit-and-converting-xml-into-json/#comments</comments>
		<pubDate>Thu, 26 May 2011 19:25:47 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[weeknotes]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=342</guid>
		<description><![CDATA[I&#8217;ve been quiet for a bit though generally because I&#8217;ve been quite busy on projects and exploring ideas. After Book Hackday, I&#8217;ve written a post about beginning to develop the Open Correspondence toolkit for the Open Knowledge Foundation&#8217;s Notebook blog. I was also contacted regarding converting the TEI XML pages into JSON, which I am [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been quiet for a bit though generally because I&#8217;ve been quite busy on projects and exploring ideas.</p>
<p>After Book Hackday, I&#8217;ve written a post about <a title="Open Correspondence toolkit" href="http://notebook.okfn.org/2011/05/25/mining-the-personal-using-open-correspondence-to-explore-correspondents/" target="_blank">beginning to develop the Open Correspondence toolkit</a> for the Open Knowledge Foundation&#8217;s Notebook blog. I was also contacted regarding converting the TEI XML pages into JSON, which I am currently working on.</p>
<p>Once I&#8217;ve done some more work on it, I&#8217;ll post the code and more about it.</p>
<p>I&#8217;ve been working on another project which may or may not be open. It is certainly interesting but I am not sure I can say much more than that. I hope to have a blog post up soon about it but I am rather excited by it and its possibilities.</p>
<p>Meanwhile, the work project continues apace with some surprising outcomes for me. Following watching a video on Facebook&#8217;s architecture, I&#8217;m beginning to see certain parts very differently. I really do hope more on this but I&#8217;ve got some building to do and a bit more delving and reading that needs completion.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/05/weeknotes-open-correspondence-toolkit-and-converting-xml-into-json/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Marking up Open Correspondence with TEI XML</title>
		<link>http://austgate.co.uk/2011/03/marking-up-open-correspondence-with-tei-xml/</link>
		<comments>http://austgate.co.uk/2011/03/marking-up-open-correspondence-with-tei-xml/#comments</comments>
		<pubDate>Sun, 20 Mar 2011 11:03:26 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=303</guid>
		<description><![CDATA[As part of the next version of Open Correspondence, I&#8217;ve been working on the XML and JSON mark-up. As part of the XML, I&#8217;ve been using the TEI mark-up for the letters. I once hard this described as &#8220;XML for people who don&#8217;t think XML is flexible enough&#8221;. Now I can see why. It is [...]]]></description>
			<content:encoded><![CDATA[<p>As part of the next version of <a title="Open Correspondence site" href="http://www.opencorrespondence.org" target="_blank">Open Correspondence</a>, I&#8217;ve been working on the XML and JSON mark-up.</p>
<p>As part of the XML, I&#8217;ve been using the <a title="TEI P5 XML mark-up" href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html" target="_blank">TEI mark-up</a> for the letters. I once hard this described as &#8220;XML for people who don&#8217;t think XML is flexible enough&#8221;. Now I can see why. It is a highly flexible solution to digitising texts but can be confusing, especially when switching between versions. I believe the original model that I had been working on was P4 but the current one is P5 so I had to negotiate that change and to make sure that I had the correct elements in the blocks. Even then, there can be two or three different versions of the same element in the section and I do have to wonder about that wisdom rather than simplifying the elements so that there are the extensible elements that may or may not be used. I&#8217;m intending to use the schema again and to really get my head around it rather than tinkering on the edges.</p>
<p>I&#8217;ve attempted this conversion before but think that I&#8217;ve finally got it to a point which is nearly there. What I would really like to do is to put together some sort of tool kit as a core to the Open Correspondence project. Clearly this would be a long-term project and would need more research but it might be useful to other projects.</p>
<p>As well as marking up texts, it would be useful to use the XML mark-up to convert the text into other formats such as Mobipocket or the Kindle formats to allow a user to create their own e-publication. It would also be useful to find a way of using the XML in conjunction with the <a title="psbook command pages" href="http://www.tardis.ed.ac.uk/~ajcd/psutils/psbook.html" target="_blank">psbook</a> command to create a print version of a letter or collection. This does mean that I need to convert the XML into a PostScript file (which raises a host of questions at the moment &#8211; such as converting structured format into layout format) and then print it.</p>
<p>I&#8217;ve also been playing around with the correspondent collections and the way of marking up collections in TEI. I had thought of this as working on creating printable collections and making the data re-usable for printing. Equally it might allow the data to be used in answer to Jonathan Gray&#8217;s question regarding identifying the letters written to a particular correspondent.</p>
<p>When I can get the XML working and validated, then I&#8217;ll look at the JSON output. It would draw a line under this part of the project and allow me to move on. I&#8217;m aiming for a release towards the end of March or middle of April in keeping with trying to keep into a six week schedule.</p>
<p>The next thing after that is to begin answering Jonathan&#8217;s questions in terms of a tool kit to identify weaknesses and to try and write some code to re-use and re-mix the data. I would hope that would be in the next release towards the end of May.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/03/marking-up-open-correspondence-with-tei-xml/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finding and mapping influences</title>
		<link>http://austgate.co.uk/2011/03/finding-and-mapping-influences/</link>
		<comments>http://austgate.co.uk/2011/03/finding-and-mapping-influences/#comments</comments>
		<pubDate>Wed, 16 Mar 2011 18:49:12 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[letters]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[rdf]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=307</guid>
		<description><![CDATA[The awesome Jonathan Gray posted an intriguing question on his blog about mapping influence in intellectual history. What he is trying to do is to map the possible routes of influence between people. In his case, it is philosophers; in mine, authors. One of the driving ideas behind the Open Correspondence RDF was to begin [...]]]></description>
			<content:encoded><![CDATA[<p>The awesome Jonathan Gray posted an intriguing question on his blog about <a title="Jonathan Gray on mapping intellectual history and influence" href="http://jonathangray.org/2011/02/20/who-read-what-mapping-influence-in-intellectual-history/" target="_blank">mapping influence in intellectual history</a>. What he is trying to do is to map the possible routes of influence between people. In his case, it is philosophers; in mine, authors.</p>
<p>One of the driving ideas behind the <a title="Open Correspondence RDF schema" href="http://www.opencorrespondence.org/schema" target="_blank">Open Correspondence RDF</a> was to begin identifying the people to whom Dickens wrote about books. Out of this I would like to create some visualisations of the data. You could possibly do this for the places, for example track his letters for one of the US tours.</p>
<p>But back to the original question. I believe this can be done (as I&#8217;ve been working on the XML issues) using Python&#8217;s rdflib. The major issue would be to get this working across version 2.4 and 3 so that any released code would be cross-platform.</p>
<p>Jonathan: as an open call, I&#8217;d love to work with you on this.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/03/finding-and-mapping-influences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adding linguistic interfaces to Open Correspondence</title>
		<link>http://austgate.co.uk/2011/03/adding-linguistic-interfaces-to-open-correspondence/</link>
		<comments>http://austgate.co.uk/2011/03/adding-linguistic-interfaces-to-open-correspondence/#comments</comments>
		<pubDate>Wed, 09 Mar 2011 11:18:59 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[open_correspondence]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=301</guid>
		<description><![CDATA[I&#8217;ve been playing around with the Python NLTK package, in particular the WordNet interface. WordNet is hosted by Princeton University. I mentioned that I was going to look at this and the idea of allow a search for lemmas of a word. It came about from a question posed on Open Literature mailing list regarding [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been playing around with the Python <a title="Python NLTK package website" href="http://www.nltk.org/" target="_blank">NLTK</a> package, in particular the <a title="NLTK WordNet interface" href="http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html" target="_blank">WordNet interface</a>. <a title="WordNet lexical database" href="http://wordnet.princeton.edu/" target="_blank">WordNet</a> is hosted by Princeton University. I mentioned that I was going to look at this and the idea of allow a search for lemmas of a word. It came about from a question posed on Open Literature mailing list regarding whether you could search it with Lemmas.</p>
<p>Xapian does word stemming but not lemmas which are slightly different. In stemming, the word production should appear as produc* since produc is the base of the word. However that is nonsense. The base of the word is produce which is what the Wordnet Lemma returns.</p>
<p>Using the API notes, I&#8217;ve been playing around with the following:</p>
<blockquote><p>from nltk.corpus import wordnet as wn</p>
<p>word_lem = set()<br />
ret_lem = []<br />
for i in wn.synsets(author):<br />
[word_lem.add(lemma.name) for lemma in i.lemmas]</p>
<p>ret_lem = list(word_lem)</p></blockquote>
<p>Having used  set to remove any duplicates, I can return the list of the lemmas that WordNet gives. Since you have to use a <a title="Wikipedia on Synsets" href="http://en.wikipedia.org/wiki/Synsets" target="_blank">Synset </a>if you don&#8217;t have the exact part of speech that a word is (Verb, Adverb, Adjective or Noun) since the lemma constructor requires that to produce the lemma. That&#8217;s fine  and I can return the names using lemma.name but the part of speech is in the synset and I&#8217;m not sure how to retrieve it but it would be useful to send back so that a user can see the part of speech and determine whether it is of interest or not.</p>
<p>In the first instance though, I can return the related synsets to the user through an API, yet to be written, and link them to the Xapian search so that they can search for the term if of interest. It begins the opening up of the letters as a linguistic dataset since the tone and language might vary across the letters depending on the correspondent. One would expect letters to his family to be less formal than to a business colleague or fellow author. I&#8217;m aiming to have an early draft up shortly with some improved XML and JSON handling for the individual letters.</p>
<p>Given that I really did not do that well in the linguistics module at the University of Leicester, I&#8217;m surprised that this has been the first API module being developed. It makes sense though but I need to find a way of getting back to the original purpose of the site.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/03/adding-linguistic-interfaces-to-open-correspondence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Open Correspondence updates</title>
		<link>http://austgate.co.uk/2011/03/weeknotes-open-correspondence-updates/</link>
		<comments>http://austgate.co.uk/2011/03/weeknotes-open-correspondence-updates/#comments</comments>
		<pubDate>Tue, 08 Mar 2011 10:01:37 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[timelines]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=298</guid>
		<description><![CDATA[I&#8217;ve bitten the bullet and done it. I&#8217;ve uploaded the current changes to the Open Correspondence site. The current changes are: additional fields in the RDF endpoint.  I still need to do some major work to JSON and XML which I hope to do for the next update. a basic text search a basic set [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve bitten the bullet and done it. I&#8217;ve uploaded the current changes to the Open Correspondence site.</p>
<p>The current changes are:</p>
<ul>
<li> additional fields in the RDF endpoint.  I still need to do some major  work to JSON and XML which I hope to do for the next update.</li>
</ul>
<ul>
<li>a basic text search</li>
</ul>
<ul>
<li>a basic set of geographic data in the collection</li>
</ul>
<ul>
<li> better linking from the letters to the correspondent and geographical  data (NB it is still incomplete)</li>
</ul>
<ul>
<li> some mapping with <a title="Open Layers Javascript mapping website" href="http://openlayers.org/" target="_blank">Open Layers</a> javascript.</li>
</ul>
<ul>
<li> a <a title="Simile timeline " href="http://www.simile-widgets.org/timeline/" target="_blank">Simile</a> timeline (which is a bit slow at the moment).</li>
</ul>
<p>Admittedly some of this is exposing work already there but hidden. However I&#8217;ve also been working on some unicode fixes to the underlying XML which is used by the project which has meant rebuilding the tables and the Xapian indexes.</p>
<p>Following a request on the Open Literature mailing list, I&#8217;m looking at the idea of using Python&#8217;s <a title="Python Natural Language Toolkit" href="http://www.nltk.org/" target="_blank">NLTK</a> to create some linguistic API wrappers around the Xapian search. It strikes me that these letters can be used to create a corpus of Dickens&#8217;s language where you can explore the language used in family correspondence (to his daughters and wife), to other authors (Wilkie Collins) and to readers. That is a longer project though in terms of building the relevant indexes.</p>
<p>I&#8217;m also looking at the idea of clustering a collection of letters to a correspondent and seeing what happens (for some reason, the current script is looking at Wilkie Collins). There is also a set of queries that one might run against letters discusing books and the publication dates to view the distribution. I&#8217;m working on these latter questions at the moment for intended release later this week but I do foresee it being delayed a while.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/03/weeknotes-open-correspondence-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Arts funding, Open Correspondence</title>
		<link>http://austgate.co.uk/2011/01/weeknotes-arts-funding-open-correspondence/</link>
		<comments>http://austgate.co.uk/2011/01/weeknotes-arts-funding-open-correspondence/#comments</comments>
		<pubDate>Sun, 16 Jan 2011 20:44:33 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[arts_funding]]></category>
		<category><![CDATA[linked_data]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=276</guid>
		<description><![CDATA[I&#8217;ve been doing some updating this week rather than anything new. I was going to spend time trying to complete the places section of the Open Correspondence website. It needs some tidying up as the endpoint has had some changes made to it. I did come across an issue which has implications in exposing other [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been doing some updating this week rather than anything new. I was going to spend time trying to complete the <a title="Open Correspondence places index" href="http://www.opencorrespondence.org/place/" target="_blank">places section of the Open Correspondence</a> website. It needs some tidying up as the endpoint has had some changes made to it. I did come across an issue which has implications in exposing other pieces of metadata, such as people who are being referred to.</p>
<p>Firstly, I need to work out a more exact way of mapping the data in the database or flat file. I think what I really need is to use something like:</p>
<ul>
<li>place</li>
<li>address</li>
<li>city</li>
<li>latitude</li>
<li>longitude</li>
<li>description</li>
<li>url</li>
</ul>
<p>The data that I have is not quite as granular as this. Yet. When I&#8217;ve done this, I need to build the mapping so that if a place is entered, say <a title="Wikipedia page on Hotel Meurice Paris" href="http://en.wikipedia.org/wiki/H%C3%B4tel_Meurice" target="_blank">Hotel Meurice, Paris</a>, then I can return the details and latitude / longitude to render an Open layers map. That&#8217;s almost the easiest bit really.</p>
<p>The second issue is the difference in names. Over time and in the heat of writing, names can change subtly. For instance <a title="Wikipedia page on Gads Hill Place" href="http://en.wikipedia.org/wiki/Gads_Hill_Place" target="_blank">Gads Hill Place</a>, one of Dickens&#8217;s homes which is now a school. In the letters it is referred to as</p>
<ol>
<li>Gad&#8217;s Hill Place,</li>
<li>Gad&#8217;s Hill Place, Higham</li>
<li>Gad&#8217;s Hill</li>
</ol>
<p>It can also be known as Gadshill Place or Gads Hill Place. I need to find a way of differencing the terms. Firstly I need to develop a way of checking inside a term and then returning it if it is a new terms or returning the mapped version if it matches a term. Secondly I need to fuzzy match the strings so that any near differences (using the <a title="Levenshtein edit distance code" href="http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python" target="_blank">Levenshtein edit distance</a>) can be checked and either ignored or exclude the term.</p>
<p>These issues will also affect the correspondent code which is being created. I suspect that anything with names will have the same issues. For  instance, Wilkie Collins is known in the letters as <a href="http://opencorrespondence.org/correspondent/view/Mr%20W%20Wilkie%20Collins">Mr W Wilkie Collins</a> and <a href="http://opencorrespondence.org/correspondent/view/Mr%20Wilkie%20Collins">Mr Wilkie Collins</a>. In the current implementation of the site, these are two different entities which is clearly wrong. They are the same entity but there is a subtle difference which is not accounted.</p>
<p>So to deal with this, I am going back to the parsing library and building these in instead. Whilst it is a slower way of dealing with these issues, it provides a chance of doing any necessary information and site re-thinking.</p>
<p>As part of this, I downloaded some <a title="TEI website" href="http://www.tei-c.org/index.xml" target="_blank">TEI </a>guidelines from the <a title="TEI Guidelines on California Digital Library" href="http://www.cdlib.org/groups/stwg/index.html" target="_blank">California Digital Library</a> to use to build the base metadata export. Ideally what I&#8217;m hoping to do is to create the data as a Python dictionary and then reformat into HTML, HTML &amp; RDFa, RDF, JSON or XML. It should allow me to export the same data for each type.</p>
<p>I&#8217;m sure at times I&#8217;ll wonder what I started but it needs doing if the site is to accept more authors. After that, back to search.</p>
<p>On a separate note, I have also done some work on the <a title="Arts funding search" href="http://austgate.co.uk/development/search_arts.php" target="_blank">Arts Funding search</a>. I&#8217;ve given it a re-skin and used the <a title="jQuery accordion widget" href="http://jqueryui.com/demos/accordion/" target="_blank">Accordion widget</a> from the JQuery UI. It also has some more search options built in so that the data can be searched by date and amount as well as political constituency and art form. The search needs to take in some arguments such as &lt; or &gt; or equals in the amount but that can come. I&#8217;ve been reading <a title="Jenni Tennison on Linked data on data.gov.uk" href="http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison" target="_blank">Jenni Tennison&#8217;s post</a> on the data.gov.uk blog to best expose the data using Linked Data.</p>
<p>Whilst writing this post, it occurs to me that whilst Linked Data is an awesome way of exposing data, useful search is still an important part of any content driven website. As blogged before, I have implemented an early version of a Xapian search. As Tim Bray has noted, advanced search might have a smaller use but it is more likely to be used by the heavier users so deserves to have time taken on it.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/01/weeknotes-arts-funding-open-correspondence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Contextualising places in time</title>
		<link>http://austgate.co.uk/2010/11/contextualising-places-in-time/</link>
		<comments>http://austgate.co.uk/2010/11/contextualising-places-in-time/#comments</comments>
		<pubDate>Sun, 21 Nov 2010 16:46:15 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[linked_data]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[place_names]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=241</guid>
		<description><![CDATA[As part of the Open Correspondence project, I&#8217;ve started to look at place names and locations to build a set of temporal and spatial data for the letters to allow for geographical queries. As part of the search, I came across a reference to Sean Gillies&#8217; useful blog post talking about modelling historical place names [...]]]></description>
			<content:encoded><![CDATA[<p>As part of the Open Correspondence project, I&#8217;ve started to look at place names and locations to build a set of temporal and spatial data for the letters to allow for geographical queries.</p>
<p>As part of the search, I came across a reference to Sean Gillies&#8217; useful blog post talking about <a title="Sean Gillies on historical placenames" href="http://sgillies.net/blog/1032/modeling-historical-places-for-pleiades/" target="_blank">modelling historical place names</a> for the Pleiades project. What intrigues me about the places is that they don&#8217;t exist in amber. They change and adapt.</p>
<p>Playing around with Open Layers (and inspired by Jo Walsh&#8217;s <a title="Jo Walsh's Mapping Hacks on historical maps" href="http://mappinghacks.com/2010/03/21/a-re-education-in-openstreetmap/" target="_blank">piece on historical maps on Mapping Hacks</a>), I&#8217;ve become interested in the idea of placing a historical map on top of a current street map so that you can see what a place looks like now and also when, for example, Dickens lived in <a title="Wikipedia on Tavistock Square" href="http://en.wikipedia.org/wiki/Tavistock_Square" target="_blank">Tavistock Square</a> or <a title="Wikipedia on Gad's Hill Place" href="http://en.wikipedia.org/wiki/Gads_Hill_Place" target="_blank">Gad&#8217;s Hill Place</a>. How has it changed? What did it look like then? What does it look like now? Does it even exist?</p>
<p>Whilst that may not aid textual analysis, it could be tied into historical and social queries about the letters. By adding this layer of data, which one might not normally think about in terms of leters data, we can find out other things of interest.</p>
<p>I think for now, I&#8217;ll try not to go too far down this road only so that I can get the other bits of data fixed first.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/11/contextualising-places-in-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Books and places for Open Correspondence</title>
		<link>http://austgate.co.uk/2010/11/weeknotes-books-and-places-for-open-correspondence/</link>
		<comments>http://austgate.co.uk/2010/11/weeknotes-books-and-places-for-open-correspondence/#comments</comments>
		<pubDate>Sun, 21 Nov 2010 12:54:36 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[places]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=246</guid>
		<description><![CDATA[Progress on the next version of Open  Correspondence has been a bit slower than I would have like. Sleep is, however, useful to being alert enough to write code. I&#8217;ve gone back to the some of the work that I was doing for the first version of the site way back last year. As part [...]]]></description>
			<content:encoded><![CDATA[<p>Progress on the next version of Open  Correspondence has been a bit slower than I would have like. Sleep is, however, useful to being alert enough to write code.</p>
<p>I&#8217;ve gone back to the some of the work that I was doing for the first version of the site way back last year. As part of the move to Linked Data, I&#8217;ve been working on a URI for places and books. Places, asn oted in previous posts, has come together and is just in need of some tidying up. I&#8217;ve managed to create an index page from the RDF endpoint using rdflib to parse the triples looking for the geo: namespace and then putting the items into a set to remove the duplicates. This needs changing as sets are unordered and I&#8217;d like the page to be ordered so that a pace can be found quickly. Perhaps a better option would be to place the raw data into a dictionary and cast to a list to sort at the last moment (or more simply sort the keys in the dictionary&#8230;) and then to remove the duplicates such as Gad&#8217;s Hill which is analogous to Gadshill. Both are used but refer to the same entity, so I need to do a difference pn the string (probably using difflib or a variant)  to identify the changes and clean up the URIs.</p>
<p>With the books, I had created a table of the publication dates and the titles, so all I need to do is to map the book&#8217;s variant titles, such as the &#8220;The Adventures of Nicholas Nickleby&#8221; is better known as &#8220;Nicholas Nickelby&#8221; or plain &#8220;Nickleby&#8221; in the letters. It might be easiest to put this into a dictionary at the moment rather than another table and to call that. I would also need to get some sort of introduction (and perhaps in the future create an Open Dickens site for the novels).</p>
<p>I&#8217;m sure I can do this in a few hours and to get it working. Must make the time now I&#8217;ve had a small break.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/11/weeknotes-books-and-places-for-open-correspondence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Humanities and building data sets</title>
		<link>http://austgate.co.uk/2010/11/digital-humanities-and-building-data-sets/</link>
		<comments>http://austgate.co.uk/2010/11/digital-humanities-and-building-data-sets/#comments</comments>
		<pubDate>Thu, 18 Nov 2010 20:22:53 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[digital_humanities]]></category>
		<category><![CDATA[mapping]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=244</guid>
		<description><![CDATA[Rob Myers reposted this New York Times link on the Open Knowledge Foundation discussion list about Digital Humanities and its growth. It mentions the Mapping the Republic of Letters project (unfortunately it does not appear to be open) and its linking together of the centres of letter production. Last night I managed to build the [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Rob Myers website" href="http://robmyers.org/" target="_blank">Rob Myers</a> reposted this <a title="NY Times on Digital Humanities" href="http://www.nytimes.com/2010/11/17/arts/17digital.html?_r=1&amp;pagewanted=all" target="_blank">New York Times</a> link on the Open Knowledge Foundation discussion list about Digital Humanities and its growth. It mentions the <a title="Mapping the Republic of Letters" href="http://republicofletters.stanford.edu/" target="_blank">Mapping the Republic of Letters</a> project (unfortunately it does not appear to be open) and its linking together of the centres of letter production.</p>
<p>Last night I managed to build the places index, the locations from where Charles Dickens wrote his letters, parsing the RDF with rdflib 3 (though I&#8217;ve been trying to enure that it is compatible with version 2.4.2 on Linux though that is to come shortly). It still needs tidying up but I&#8217;m going to post what I have with the next draft of the site. Dickens&#8217;s letters cover France, the US and the UK and gradually I&#8217;m getting the latitude and longtitudes for the locations so that they can be used with <a title="OpenLayers mapping site" href="http://www.openlayers.org" target="_blank">Open Layers </a>maps.</p>
<p>Does the project necessarily break new ground? Possibly, possibly not. I&#8217;m not a Dickens scholar (though I do think he could turn a phrase or two&#8230;) I hope that the project will allow us to think about ways of linking and sharing the data as well as find different ways of mining it. Mapping seems like one query. Placing the letters into context as well using historical data would be useful (but I need to find the right data sources &#8211; not just Wikipedia).</p>
<p>In all, I firmly believe that it is important to create and experiment with the data available and to think of new queries or mash-ups. What we create now will probably be placed into the shadows by the next generation but why wait. Let&#8217;s have fun now.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/11/digital-humanities-and-building-data-sets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

