<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Aust Gate &#187; Information Retrieval</title>
	<atom:link href="http://austgate.co.uk/category/informationretrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://austgate.co.uk</link>
	<description>Open Knowledge and Literature</description>
	<lastBuildDate>Tue, 08 May 2012 20:33:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Thinking about texts and communities at Textcamp</title>
		<link>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/</link>
		<comments>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/#comments</comments>
		<pubDate>Sun, 14 Aug 2011 12:33:01 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[open_literature]]></category>
		<category><![CDATA[textcamp]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=378</guid>
		<description><![CDATA[Having gone to Textcamp yesterday, I started playing with Wordle and IBM&#8217;s Many Eyes at the suggestion of Dave Flanders of the JISC. As James Harriman-Smith, the organiser and Open Literature co-ordinator for the Open Knowledge Foundation, had suggested that this year is the anniversary of the manuscript of Alexander Pope&#8216;s An Essay in Criticism, [...]]]></description>
			<content:encoded><![CDATA[<p>Having gone to <a title="Textcamp on Open Literature" href="http://wiki.openliterature.net/Text_Camp_2011" target="_blank">Textcamp</a> yesterday, I started playing with Wordle and IBM&#8217;s Many Eyes at the suggestion of <a title="David Flanders JISC staff page" href="http://www.jisc.ac.uk/contactus/staff/davidfflanders" target="_blank">Dave Flanders</a> of the<a title="JISC website" href="http://www.jisc.ac.uk/" target="_blank"> JISC</a>. As <a title="James Harriman-Smith's OKF page" href="http://okfn.org/members/jameshs/" target="_blank">James Harriman-Smith</a>, the organiser and Open Literature co-ordinator for the Open Knowledge Foundation, had suggested that this year is the anniversary of the manuscript of <a title="Wikipedia on Alexander Pope" href="http://en.wikipedia.org/wiki/Alexander_Pope" target="_blank">Alexander Pope</a>&#8216;s <a title="Wikipedia on Essay on Criticism" href="http://en.wikipedia.org/wiki/An_Essay_on_Criticism" target="_blank">An Essay in Criticism</a>, I popped the Gutenberg text into Wordle to see what it <a title="Wordle on Pope's Essay in Criticsm" href="http://www.wordle.net/show/wrdl/3912697/Essay_in_Criticism" target="_blank">shows as a tag cloud</a>. <a title="Wordle: Essay in Criticism" href="http://www.wordle.net/show/wrdl/3912697/Essay_in_Criticism"><img style="padding: 4px; border: 1px solid #ddd;" src="http://www.wordle.net/thumb/wrdl/3912697/Essay_in_Criticism" alt="Wordle: Essay in Criticism" align="left" /></a> The dominance of wit is not a surprise as Wit in poetry was a prized quality for Pope and Dryden. There are some small issues such as &#8216;still&#8217; and &#8216;Still&#8217; and perhaps this could be rectified by making everything lower case but this also presents other issues if two words are similar but the capitalisation suggests a different intonation. As I&#8217;ve <a title="Post on Word clouds" href="http://austgate.co.uk/2010/10/tagging-the-revolution-exploring-edmund-burkes-reflections-on-the-revolution-in-france/" target="_blank">blogged before</a>, word clouds are great but not if they don&#8217;t link so, at some point in the future, I&#8217;ll sit down and actually upload a table to create a useful tag cloud. John Levin, of <a title="James Levin's blog onAnterotesis on Ecco" href="http://anterotesis.com/wordpress/2011/08/making-the-tcp-ecco-texts-accessible/" target="_blank">Anterotesis</a>, loaded a csv file of the recently released ECCO files. He loaded Volume Four of Defoe&#8217;s Tour of the Whole Island of Great Britain, which features Scotland.</p>
<div id="attachment_383" class="wp-caption alignleft" style="width: 190px"><a href="http://austgate.co.uk/wp-content/uploads/2011/08/oenvq.jpg"><img class="size-medium wp-image-383" title="Wordcloud of Defoe's journey" src="http://austgate.co.uk/wp-content/uploads/2011/08/oenvq-180x300.jpg" alt="Wordcloud of Defoe's journey taken at Textcamp by Dave Flanders" width="180" height="300" /></a><p class="wp-caption-text">Wordcloud of Defoe&#39;s journey taken at Textcamp</p></div>
<p>Using the Many Eyes Word Cloud, we can see that Scotland is unsurprisingly the largest item but also Lord and Earl are also popular, suggesting that he stopped with or met the aristocracy rather than just travelling randomly. Dave Flanders and John created some cool visualisations using the tool which allow you to follow words in the text and to see which are the most linked to words (using bigrams I would suppose) in a tree fashion. It is certainly something at I will be looking up later for &#8220;quick win&#8221; visualisations.</p>
<p>One of the intriguing projects that was suggested was building our own DIY bookscanner using links currently stored on the <a title="DIY Bookscanner" href="http://wiki.openliterature.net/Tcamp11/DIYD" target="_blank">Textcamp 2011 wiki pages</a>. I think that Dave Flanders might be organising a hack weekend to actually build the machine for real use. I find it interesting but thinking that it would be cool to also see if can be built at home or using iPhone / Android OSes which also entails a software hack, unless an app already exists. That is something to explore later.</p>
<p>Mark MacGillivray, of OKFN and <a title="Cottage Labs" href="http://cottagelabs.com/" target="_blank">Cottage Labs</a>,  and Brian Hole of <a title="Ubiquity Press" href="http://www.ubiquitypress.com/" target="_blank">Ubiquity Press</a>, spoke about Open Access and making scholarship open but also retaining its rigour. Using Open Access, we should be able to share the data, the ways of interpreting it and and the final interpretation which is published.</p>
<p>The science community has been doing this for some while and things like the Panton Principles and Science Commons are showing the way. One of the ideas was to write a handbook for how to use openness in literature and that it is something that we need address and build on. We ought to write an open guide / manual and build on / develop the Panton Principles where necessary as a core set of principles to work with.</p>
<p>Having days like Textcamp and Book Hackday are extremely useful to think about this and to work on the ideas. It is easy to get into echo chambers of mailing lists and blogs, we need these events to meet new people, be challenged to explain ourselves and to either build on the day or go away with ideas to test and try out. The day has excited me out using word clouds again and doing a bit more work on them as a tool to make them useful. It has also got me excited about book scanning and doing some hardware hacking (which I&#8217;ve not really done) before.</p>
<p>Running the Pope essay through Wordle makes me excited about testing what we can do with the ECCO TEI documents that John Levine  links to. Can we hyperlnk to other texts, author and events that are mentioned in it (not just with the annotator tool but in generated HTML) or use HTML 5 to embed audio links to further discussions or pronunciation (for example Byron&#8217;s Don Juan which has been argued as pronounced &#8220;Jew-an&#8221; rather that &#8220;Hwan&#8221; and the arguments for and against).</p>
<p>Perhaps that gets to one of the issues that arose in the break-out discussions in the kitchen. After the lightning talk about digital publishing, there seemed to be an argument about whether current digital publishing was really pushing the boundaries or flailing around. I do think that it has some real benefits for niche publishing but these have not been fully explored. The model will need to change and perhaps become more open in those senses, perhaps linking the raw data to the interpretation earlier to allow the relevant community to peer review the data earlier. Just a suggestion. There are two distinct communities, the top-down business layer and the grass roots layer, activists, data developers and so on. Both would appear to have broadly similar aims but how to put them together  in a useful way for both to learn. Don&#8217;t get me wrong here as I believe I&#8217;m at the grass roots layer, but I think that both sides do have a dialogue which could get around the issues that the music and film industries have found themselves in, i.e. confrontation. We are here to disrupt and make.because we are passionate.  We care about the industry. Publishing is an industry which needs to change and transform itself. Put the two together and there are ways of moving forward. My hope is that in future events, we could get some more publishers along to the event.</p>
<p>The other important thing is that these conversations carry on afterwards. The round table discussions where great as were the break-out in the kitchen ones but they need to carry on or we create our own echo chamber which reduces the value of what happened yesterday.</p>
<p>Whilst I did not do as much coding as I wanted to yesterday, I met some new people and caught up with colleagues. The fact that organisations such as JISC are supporting events like this shows their underlying importance and use to the community. We&#8217;ve started, now we need to carry on by chatting, blogging, sharing and doing more of these events.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/08/thinking-about-texts-and-communities-at-textcamp/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Documents and data</title>
		<link>http://austgate.co.uk/2011/07/weeknotes-documents-and-data/</link>
		<comments>http://austgate.co.uk/2011/07/weeknotes-documents-and-data/#comments</comments>
		<pubDate>Sun, 03 Jul 2011 14:39:22 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[documents]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[linked_data]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=364</guid>
		<description><![CDATA[The main project this week (apart from hte onging one of moving and virtualising servers) is to begin work on our technical documents. I&#8217;m trying to move them onto the web and make the useful, not only in terms of reading about them but also to make them linkable. I&#8217;m trying to get them out [...]]]></description>
			<content:encoded><![CDATA[<p>The main project this week (apart from hte onging one of moving and virtualising servers) is to begin work on our technical documents.</p>
<p>I&#8217;m trying to move them onto the web and make the useful, not only in terms of reading about them but also to make them linkable. I&#8217;m trying to get them out of being placed on a web site as Word or PDF downloads and move them into being web pages with comments. Drupal 7&#8242;s inbuilt book module is probably the way to go and is producing some really nice results in the hacking I managed on Friday. There is a certain pleasure now in that I began the hack at 8:30 and within an hour, I had a working document (albeit I wanted to mess around with the URLs to make the nicer and far more meaningful). It had comments and was generally felt to be good.</p>
<p>The next task was to work on a way of doing Frequently Asked Questions (FAQs). Having begun some of the work using the <a title="Frequently Asked Questions Drupal module" href="http://drupal.org/project/faq" target="_blank">Frequently Asked Questions module</a>, I decided it had to many issues for us (including not being able to control where the page was and it did not appear to play nicely wiht the <a title="Pathauto Drupal module" href="http://drupal.org/project/pathauto" target="_blank">Pathauto rewriting module</a>), I write my own content type which we can manipulate via the Views module to create sets of FAQs. When I&#8217;ve got more time, I may come back to the module and try to help fix some bugs.</p>
<p>Whilst neither of these are finished items, it was a pleasant day hacking and creating, getting prototypes ready in a day. I&#8217;m taking this as a sign of increasing familiarity with Drupal. I do, however, need to find a morning to finish the Sugar SOAP integration module and tidy that up. Ideally I&#8217;d trying to find a way of integrating it with the current module to offer swapable backends.</p>
<p>I&#8217;ve also started looking at using <a title="Redis website" href="http://redis.io" target="_blank">Redis</a> for caching again in a major way to ensure that various static fields of data, such as UK counties, can have a common reference to reduce data cleaning issues such as county begin written as co., co and county. I&#8217;m also looking at the issue of Linked Data and how to integrate the ideas into our current projects. For now I&#8217;m rereading <a title="Tim Berner's-Lee on Linked Data" href="http://www.w3.org/DesignIssues/LinkedData.html" target="_blank">Tim Berners-Lee&#8217;s guide</a>, linked from the <a title="Linked Data website" href="http://linkeddata.org/" target="_blank">linkeddata.org </a>website and formulating ideas and refining the ones I currently have.</p>
<p>Ambition might bet the better of me but at least I feel like I want to take all of this on and to try to improve skills and learn more. In the meanwhile, I have some serious hills to climb.</p>
<p>Update:  This post has got me rethinking the Open Correspondence RDF and Linked Data. The more I delve, the greater my sense of needing to rethink that part of the project and to complete the correspondence links. Most of them are there but need complete linking. I also need to look at the Python&#8217;s <a title="Python's RDFLib code" href="http://www.rdflib.net/" target="_blank">RDFlib</a> and perhaps make better use of the Sparql qeuries and stores. I sense an evening or several of experimentation before a hacking weekend to resolve these issues.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/07/weeknotes-documents-and-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Research Databases in the Humanities</title>
		<link>http://austgate.co.uk/2011/01/research-databases-in-the-humanities/</link>
		<comments>http://austgate.co.uk/2011/01/research-databases-in-the-humanities/#comments</comments>
		<pubDate>Sun, 23 Jan 2011 12:11:30 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[digital_humanities]]></category>
		<category><![CDATA[rdf]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=281</guid>
		<description><![CDATA[I went to the Research Databases in the Humanities workshop, organised by Sudamih, which was an excellent afternoon and time well spent. An Oxford heavy event, there were a number of interesting directions that came out of the afternoon. Firstly James Wilson, project manager of Sudamih at Oxford University Computing Services, outlined the Database as [...]]]></description>
			<content:encoded><![CDATA[<p>I went to the <a title="Research Databases in the Humanities worskhop site" href="http://sudamih.oucs.ox.ac.uk/databases_workshop.xml" target="_blank">Research Databases in the Humanities workshop</a>, organised by Sudamih, which was an excellent afternoon and time well spent. An Oxford heavy event, there were a number of interesting directions that came out of the afternoon.</p>
<p>Firstly James Wilson, project manager of <a title="Sudamih website" href="http://sudamih.oucs.ox.ac.uk/" target="_blank">Sudamih</a> at Oxford University Computing Services, outlined the Database as a Service (DaaS) project which I think outlines a desperately needed service. The project seeks to allow researchers to upload their datasets (I believe from SQL and CSV) into a MySQL, PostreSQL infrastructure with  a commion front end though with  access control levels to the data itself. The idea is to keep data sets available for long term use.</p>
<p>The second important point was that data sets need to be kept available if researchers move on, kept open for sharing since the same data can be used across the field or even by different disciplines or funding ends. Resources, as <a title="Claire Warwick's page at UCL" href="http://www.ucl.ac.uk/infostudies/claire-warwick/" target="_blank">Claire Warwick</a> of UCL, need to be kept available for the long term, partially in response to promises to funding bodies but also for citation purposes and re-use by future scholars. There are sites which are now appear moribund but could be kept useful if the data could be moved somewhere or the project kept in the service such as the above DaaS concept. Of course sites do need funding to stay alive and the notions of sustainable business models (from free to pay for access) were skirted over.</p>
<p>(I do wonder if it is practical to offer / build something like this as an <a title="Open Knowledge Foundation site" href="http://okfn.org" target="_blank">Open Knowledge Foundation</a> project as an adjunct for CKAN for smaller projects. But perhaps that is another post for another day&#8230;)</p>
<p>Jacob Dahl, of the <a title="CDLI website" href="http://cdli.ucla.edu/" target="_blank">Cuneiform Digital Library Initiative</a>, was one of the only speakers who touched on the openness issue. He commented that there is a site which draws from his open database and makes some amendments but these are then not offered openly back to the originating site ir users un an open fashion. Again this leads to a &#8220;silo&#8221; mentality which prevents knowledge being shared and developed. This is a more insidious threat to developing datasets and databases since the knowledge cannot be easily shared. The scary thing about this is that rather than the websites being made moribund, the data itself is and a community cannot develop around it to refresh and maintain the data. Perhaps this is a more long term threat to Digital Humanities than tired-looking websites.</p>
<p>On a tangent, one of the speakers mentioned Alastair Dunning&#8217;s <a title="Alasitair Dunning on digitisation and its needs" href="http://digitisation.jiscinvolve.org/wp/2011/01/21/does-the-digital-humanities-need-more-digitisation/" target="_blank">blog post on digitisation</a>. I&#8217;m not going to summariuse as it is fairly short but the outcome that I take away is that digitisation is necessary but it needs to allow users to create new queries. This cannot happen unless the dataset is maintained and that access is given through APIs or search. (Funny how search comes back again. I&#8217;m sure it is haunting me.)</p>
<p>The afternoon was rounded off with a talk about the <a title="CLAROS project homepage" href="http://www.clarosnet.org/about/default.htm" target="_blank">CLAROS project</a> which  is using Semantic Web technologies to query several major databases of Classical Art across the world. It is something that I&#8217;m interested in (with the endpoints on <a title="Open Correspondence website" href="http://www.opencorrespondence.org" target="_blank">OpenCorrespondence</a> but I&#8217;m not quite there yet) and it marks the future for projects but I do wonder if the basics, the technological infrastructure for researchers needs to exist first. It comes back to the chicken and egg though. If the possibilities are not given and developed in prototype or early working models, then they remain only possibilities and not useful.</p>
<p>I believe that there are  a number of outcomes that arose and debate which  I&#8217;ve outlined above rather than talking about the individual talks. We come back to the notions of openness and preservation. I think that there a quite a few things that could be developed to aid researchers and also issues to keep in mind for developing future resources.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/01/research-databases-in-the-humanities/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Searching Open Correspondence with Xapian</title>
		<link>http://austgate.co.uk/2011/01/searching-open-correspondence-with-xapian/</link>
		<comments>http://austgate.co.uk/2011/01/searching-open-correspondence-with-xapian/#comments</comments>
		<pubDate>Sun, 09 Jan 2011 15:01:26 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[xapian]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=270</guid>
		<description><![CDATA[As part of the continuing work on Open Correspondence, I managed to install Xapian to act as a full text search engine. I&#8217;ve been looking to do this for a while and had started on working on a remote back end (as blogged here) but decided not to use it as it appears to have [...]]]></description>
			<content:encoded><![CDATA[<p>As part of the continuing work on <a title="The Open Correspondence website" href="http://www.opencorrespondence.org" target="_blank">Open Correspondence,</a> I managed to install <a title="Xapian project" href="http://xapian.org/" target="_blank">Xapian</a> to act as a full text search engine. I&#8217;ve been looking to do <a title="Austgate post on Xapian" href="http://austgate.co.uk/2010/10/installing-xapian-into-open-correspondence-and-next-steps/" target="_blank">this for a while</a> and had started on working on a remote back end (<a title="Xapian remote back end" href="http://austgate.co.uk/2010/11/weeknotes-open-correspondence-xapian-and-linked-data/" target="_blank">as blogged here</a>) but decided not to use it as it appears to have a lack of security if being used on different machines across the web. I suppose you could place it behind web service and expose it that way if you want to create a secure remote back end.</p>
<p>The search  is rather basic, a simple form to enter a phrase or words, and the results show the text and the letter url. On the list of things to do is to create an advanced form to allow the user to filter the results down further by date or to find relevance in the text.</p>
<p>From what I can see there are things that I can do on top of the simple search to achieve this. It would be useful to be able to cut the selection down by date which could be parsed from the text and anything not in it is discarded. Perhaps making the searches less naive and trying to discover relevance in the results. Perhaps there is somebody called Nickleby in the letters who is not part of the novel, &#8216;Nicholas Nickleby&#8217;.</p>
<p>Simply put there is a fair amount of data munging that needs to go on next. That&#8217;s fine.</p>
<p>The next step that I&#8217;m working on is the use of <a title="Python OFS website" href="http://pypi.python.org/pypi/ofs/0.1" target="_blank">OFS</a> to run some of the endpoints and XML streams that are used for internal purposes, such as locations or the RDF endpoint. I&#8217;m hoping to use it to bring through the Linked Data into the letters themselves. I&#8217;m looking at using these mainly for performance reasons. Along with a hack on the places that I&#8217;m hoping to do next week, the man body of Open Correspondence will be done.</p>
<p>Next up is better data munging and information extraction, such as rewriting the parser and adding more letters into the database. Essentially I&#8217;d like to provide better data in accessible formats for the letters and to perhaps offer some tools to kickstart development.</p>
<p>I&#8217;m going to the <a title="Research Databases in Humanities workshop" href="http://sudamih.oucs.ox.ac.uk/databases_workshop.xml" target="_blank">Research Databases in the Humanities</a> workshop to see what else we can do with the data and the site.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/01/searching-open-correspondence-with-xapian/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finding the data signal in the noise</title>
		<link>http://austgate.co.uk/2010/12/finding-the-data-signal-in-the-noise/</link>
		<comments>http://austgate.co.uk/2010/12/finding-the-data-signal-in-the-noise/#comments</comments>
		<pubDate>Thu, 30 Dec 2010 09:40:33 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[data mining]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=266</guid>
		<description><![CDATA[Marshall Kirkpatrick, on ReadWriteWeb, poses the question A web of infinite information: does that sound like a scary problem of &#8220;just too much&#8221;? in a &#8220;Mamas, Don&#8217;t Let Your Babies Grow Up to Be Data Wranglers&#8221; where he discusses an interview with Evan Williams on GigaOm. (I&#8217;m not going to discuss the interview here (but [...]]]></description>
			<content:encoded><![CDATA[<p>Marshall Kirkpatrick, on ReadWriteWeb, poses the question</p>
<blockquote><p>A web of <em>infinite information</em>: does that sound like a <em>scary problem</em> of  &#8220;just too much&#8221;?</p></blockquote>
<p>in a &#8220;<a title="Marshall Kirkpatrick on Datawranglers on ReadWriteWeb" href="http://www.readwriteweb.com/archives/mamas_dont_let_your_babies_grow_up_to_be_data_wran.php" target="_blank">Mamas, Don&#8217;t Let Your Babies Grow Up to Be Data Wranglers</a>&#8221; where he discusses an interview with <a title="Evan Williams interview on GigaOm blog" href="http://gigaom.com/2010/12/29/evan-williams-on-web-of-infinite-information/" target="_blank">Evan Williams on GigaOm</a>. (I&#8217;m not going to discuss the interview here (but it is an interesting read).)</p>
<p>(I&#8217;m not sure that I can agree on the idea of the decentralised web being dead(I&#8217;m not sure it is) but the links between sites and services are becoming increasingly visible, sometimes deliberately so and sometimes because they are being used as a service. However I digress&#8230;)</p>
<p>In response to the Om Malik&#8217;s question: &#8220;<em>You feel there is just too much stuff on the web these days?&#8221;</em>, Williams responds:</p>
<blockquote><p>There’s too much stuff. It seems to me that almost all tools we rely on  to manage information weren’t designed for a world of infinite info.  They were designed as if you could consume whatever was out there that  you were interested in.</p></blockquote>
<p>before identifying Twitter as a response to this and finding the signal in the noise.This response still requires the user to identify the signal that they are interested in and to follow it. Machine algorithms may not necessarily identify what a user is precisely interested in  but, from what I understand, they are getting better.</p>
<p>However we come back to the question that Tim Davies posed in the <a title="Tim Davies on the open data hack day" href="http://www.timdavies.org.uk/2010/12/05/reflections-on-oxford-open-data-day/" target="_blank">aftermath of the Open Data hackday</a>. He identifies that that there are two approaches to Open Data data-led or problem-led. The data-led approach finds &#8220;some data of interest, and then explored what could be done with it.&#8221; and the problem-led is to start &#8220;with an issue to explore and then seeking data to work with&#8221;. Of course both have their issues (data-led can lose focus or problem led can stuggle to identify the relevant data).</p>
<p>This encapsulates the issue that Evan Williams identifies with the infinite amount of data on the web and how to make it useful. The approach still seems to be very much a data-led one rather than a problem-led one (though this is not to say that these approahces do not exist for apps or sites). The tools to usefull mine the existing data are only just being developed and it recalls something that I heard a while ago. The tools that we have now solve existing or previous problems but not tomorrow&#8217;s ones.  They are being developed by interested parties but somehow we need to get the data developers talking to the problem solvers to really make the existing data sets.</p>
<p>In the blurb to Phillip Janet&#8217;s <a href="http://www.amazon.co.uk/gp/product/0596802358?ie=UTF8&amp;tag=throughtheloo-21&amp;linkCode=as2&amp;camp=1634&amp;creative=6738&amp;creativeASIN=0596802358">Data Analysis with Open Source Tools</a><img style="border: none !important; margin: 0px !important;" src="http://www.assoc-amazon.co.uk/e/ir?t=throughtheloo-21&amp;l=as2&amp;o=2&amp;a=0596802358" border="0" alt="" width="1" height="1" />, there is an appropriate line:</p>
<blockquote><p>purpose is more important than process</p></blockquote>
<p>So Marshall&#8217;s question might well be turned slightly around. Rather than looking at the notion of there being too much data, its one of how to identify the purpose to mine the noise for the signal. There is only too much information if you do not have a purpose in looking at it.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/12/finding-the-data-signal-in-the-noise/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hacking Arts Council data</title>
		<link>http://austgate.co.uk/2010/12/hacking-arts-council-data/</link>
		<comments>http://austgate.co.uk/2010/12/hacking-arts-council-data/#comments</comments>
		<pubDate>Sun, 05 Dec 2010 12:31:24 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[arts_council]]></category>
		<category><![CDATA[open_data]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=253</guid>
		<description><![CDATA[I lost my hackday cherry yesterday and went to the Open Data hackathon to look at the South East arts council data found at the data.gov.uk site (http://data.gov.uk/dataset/grants-for-the-arts-awards-arts-council-england). Our hosts, White October, were fantastic and welcoming (and put the kettle on as soon as I came in!) and Incuna provided the much needed pizzas for [...]]]></description>
			<content:encoded><![CDATA[<p>I lost my hackday cherry yesterday and went to the Open Data hackathon to look at the South East arts council data found at the data.gov.uk site (<a title="http://data.gov.uk/dataset/grants-for-the-arts-awards-arts-council-england/" rel="nofollow" href="http://data.gov.uk/dataset/grants-for-the-arts-awards-arts-council-england" target="_blank">http://data.gov.uk/dataset/grants-for-the-arts-awards-arts-council-england).</a> Our hosts, <a title="White October site" href="http://www.whiteoctober.co.uk/" target="_blank">White October</a>, were fantastic and welcoming  (and put the kettle on as soon as I came in!) and <a title="Incuna website" href="http://incuna.com/" target="_blank">Incuna</a> provided the much needed pizzas for lunch.</p>
<p>Al Power came up with the cash visualisation here : <a title="http://www.d1080072.cp.blacknight.com/hackday/" rel="nofollow" href="http://www.d1080072.cp.blacknight.com/hackday/" target="_blank">http://www.d1080072.cp.blacknight.com/hackday/</a> and Andy Cotgrove came up with the cool Tableau visualisation (<a title="Tableau visualisation of arts council data" href="http://public.tableausoftware.com/views/Artsfunding/Fundingamountinteractive?:embed=yes&amp;:tabs=yes&amp;:toolbar=yes" target="_blank">http://public.tableausoftware.com/views/Artsfunding/Fundingamountinteractive?:embed=yes&amp;:tabs=yes&amp;:toolbar=yes</a>) using all five years of available data on the site.</p>
<p>I played with <a title="tag cloud of arts data" href="http://austgate.co.uk/development/funding_tag.php" target="_blank">tag clouds</a> to start diving into the data itself which links to further and more detailed data (which needs some serious work) and also Simile to create a <a title="Simile timeline of Arts Council funding data" href="http://austgate.co.uk/development/funding_simile.html" target="_blank">timeline</a> but I&#8217;m not 100% sure that this is successful really.</p>
<p>Since the hack, I&#8217;ve added the complete data set that Andy Colgreave put together to the database. I&#8217;m now trying to build a front end to it and also to create the dataset in XML and JSON to add it to <a title="CKAN archive site" href="http://www.ckan.net" target="_blank">CKAN.net</a>. Perhaps the dataset is probably the easiest task and the front end can be developed in time.</p>
<p>I&#8217;d like to add the <a title="Heritage Lottery Fund site" href="http://www.hlf.org.uk/Pages/Home.aspx" target="_blank">Heritage Lottery Fund</a> data where is applies to arts or data from the <a title="Open Charities website" href="http://opencharities.org" target="_blank">opencharities.org</a> website to try to build a better picture of the funding a projects that are happening.</p>
<p>It might take some time&#8230;</p>
<p>Update: <a title="Tim davies on Oxford odhd hackathon" href="http://www.timdavies.org.uk/2010/12/05/reflections-on-oxford-open-data-day/" target="_blank">Tim Davies</a> has got an thought-provoking post on the day</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/12/hacking-arts-council-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Open Correspondence, Xapian and Linked Data</title>
		<link>http://austgate.co.uk/2010/11/weeknotes-open-correspondence-xapian-and-linked-data/</link>
		<comments>http://austgate.co.uk/2010/11/weeknotes-open-correspondence-xapian-and-linked-data/#comments</comments>
		<pubDate>Sun, 07 Nov 2010 10:58:20 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[charles dickens]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[xapian]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=233</guid>
		<description><![CDATA[After last week&#8217;s server move, we discovered one or two things that needed to be changed before they could go live. The main thing was the Xapian search which I had been working on. The initial version kept the Xapian server on the local machine and used that to index and search the letters butt [...]]]></description>
			<content:encoded><![CDATA[<p>After last week&#8217;s server move, we discovered one or two things that needed to be changed before they could go live. The main thing was the Xapian search which I had been working on. The initial version kept the Xapian server on the local machine and used that to index and search the letters butt he new version is distributed across machines so it required a brief change.</p>
<p>Opening a &#8220;one box wonder&#8221; Xapian search in Python is done via:</p>
<blockquote><p>xapian.WritableDatabase(db_path, xapian.DB_CREATE_OR_OPEN)</p></blockquote>
<p>where db_path is the database name that you want to give the index and open the index using:</p>
<blockquote><p>xapian.Database(db_path)</p></blockquote>
<p>Since the project uses Pylons, the controller used a path out to the .ini file loaded at runtime to link to the correct database.</p>
<p>Using the documentation on the <a title="Xapian Documentation on remote backends" href="http://xapian.org/docs/remote.html" target="_blank">Xapian site for remote backends</a> and the<a title="Xapian Python bindings documentation" href="http://xapian.org/docs/bindings/python/" target="_blank"> Python bindings</a>, I was able to quickly adjust the code so that xapian.WritableDatabase is replaced by:</p>
<blockquote><p>xapian.remote_open_writable(&#8220;&lt;host name&gt;&#8221;, &#8220;&lt;port number&gt;&#8221;)</p></blockquote>
<p>and is opened by:</p>
<blockquote><p>xapian.remote_open(&#8220;&lt;host name&gt;&#8221;, &#8220;&lt;port number&gt;&#8221;)</p></blockquote>
<p>Once that is set up, then all you need to do is to start the the TCP server which is what I&#8217;ve been looking at. I downloaded the tar.gz file of Xapian-core from the Xapian site, configured and made on Ubuntu Lucid Lynx and then ran xapian-tcpsrv &#8211;port &lt;port number&gt; &lt;database name&gt; in a new terminal window which allowed me to test the connections and get them ready for going live.</p>
<p>Changes are afoot on the Open Correspondence site as well. As part of a conversation that involved Keith Alexander, of <a title="Talis Platform" href="http://www.talis.com/platform" target="_blank">Talis</a>, the project is going to evolve into a slightly more Linked Data direction with references to the books, magazines, correspondents and so on. I&#8217;d already started going in this direction with the correspondent links (such as <a title="Georgina Hogarth correspondent link on Open Correspondence" href="http://www.opencorrespondence.org/letters/correspondent/Miss%20Hogarth" target="_blank">http://www.opencorrespondence.org/letters/correspondent/Miss%20Hogarth</a>) so this is really an extension of where we need to go to connect to other resources such  as Dbpedia, Wikipedia and so on. The fact that it is <a title="Dickens 2012 website" href="http://www.dickens2012.org" target="_blank">Dickens&#8217;s bi-centenary in 2012</a> gives an added boost to the project. The Linked Data approach gives us the chance of creating some sort of framework for future expansion and linking together of data sources, not only at a literary level but also socially. It also encourages me to sort out the content negotiation work that was started and to try and follow the FAQs that the <a title="Pedantic Web group site" href="http://pedantic-web.org/" target="_blank">Pedantic Web</a> group have posted to make sure that the site follows the best standards that it can and to build them into future developments and directions.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/11/weeknotes-open-correspondence-xapian-and-linked-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tweeting changes with Node.js</title>
		<link>http://austgate.co.uk/2010/11/tweeting-changes-with-node-js/</link>
		<comments>http://austgate.co.uk/2010/11/tweeting-changes-with-node-js/#comments</comments>
		<pubDate>Wed, 03 Nov 2010 21:01:57 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[node.js]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=230</guid>
		<description><![CDATA[As a break from Open Correspondence, I&#8217;ve been looking at node.js, the server side Javascript library. I&#8217;ve been thinking about the document stuff that I&#8217;ve been working on with Milton. One of the things that I had mooted as an idea was reading Twitter and pushing them back to the document. I&#8217;ve been playing with [...]]]></description>
			<content:encoded><![CDATA[<p>As a break from Open Correspondence, I&#8217;ve been looking at<a title="Node.js home site" href="http://nodejs.org/" target="_blank"> node.js</a>, the server side Javascript library. I&#8217;ve been thinking about the document stuff that I&#8217;ve been working on with Milton. One of the things that I had mooted as an idea was reading Twitter and pushing them back to the document. I&#8217;ve been playing with Node as an idea of having a server which can store Tweets or push them out for the last 20 minutes, using <a title="Net tuts on node.js" href="http://net.tutsplus.com/tutorials/javascript-ajax/learning-serverside-javascript-with-node-js/" target="_blank">net.tuts tutorial</a> as a way of getting up to speed quickly.</p>
<p>As with storing the Open Correspondence data in CouchDb to prevent reparsing, I&#8217;m wondering if Node can be used as a server to get and store changing data from an interface in JSON and then storing it in CouchDb and letting a client know it has done all that. It would allow for a framework to be continually updated from different data sources.  Just an idea.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/11/tweeting-changes-with-node-js/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Ubuntu, messaging and Open Correspondence</title>
		<link>http://austgate.co.uk/2010/08/weeknotes-ubuntu-messaging-and-open-correspondence/</link>
		<comments>http://austgate.co.uk/2010/08/weeknotes-ubuntu-messaging-and-open-correspondence/#comments</comments>
		<pubDate>Sun, 29 Aug 2010 11:04:40 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[messaging]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=201</guid>
		<description><![CDATA[It has been a while since the last weeknotes. I&#8217;ve finally made the move to Linux, or at least dual booting, by installing Ubuntu so I&#8217;m currently learning a little the OS and getting a development environment set up for it. I&#8217;ve nearly finsihed the ongoing accounts project at work. The framework is up and [...]]]></description>
			<content:encoded><![CDATA[<p>It has been a while since the last weeknotes. I&#8217;ve finally made the move to Linux, or at least dual booting, by installing Ubuntu so I&#8217;m currently learning a little the OS and getting a development environment set up for it.</p>
<p>I&#8217;ve nearly finsihed the ongoing accounts project at work. The framework is up and it went through testing over the last couple of weeks. There are a few rough edges and some bugs which still need fixing but it largely seems to be there now.</p>
<p>I&#8217;ve also installed the first part of a messaging server written in PHP (taking ideas and concepts from <a title="JMS page on Sun/Oracle site" href="http://www.oracle.com/technetwork/java/index-jsp-142945.html" target="_blank">JMS</a> and Python&#8217;s <a title="Python's routes" href="http://routes.groovie.org/" target="_blank">Routes</a> for service urls) which takes a message from the core CMS system and routes them to the correct service using SOA. If there&#8217;s an issue with the service then it logs it and queues the message using Redis (athough an array might be quicker, I wanted the queue decoupled from the server if it failed or had to be restarted and the memory was wiped). I need to finish up the worker to dequeue at certain points in time but it is expected that I&#8217;ll get it finished in about four days once I&#8217;m back at work.</p>
<p>I&#8217;ve done one or two things on the Open Correspondence site as well. I&#8217;ve tidied up the source XML and the sources XML as well to expose them so I need to update the site itself. The next thing I think we need to do is to start writing stuff to expose the underlying data and to show what you can do with the data. One of the things that I want to do is to write a function which I can put behind either <a title="Protovis toolkit" href="http://vis.stanford.edu/protovis/" target="_blank">Protovis</a> or <a title="Javascript Infoviz Toolkit" href="http://thejit.org/" target="_blank">Javascript Infovis Toolkit</a> to convert a SPARQL query into the relevant JSON and I&#8217;m thinking of using Lee Feigenbaum&#8217;s <a title="sparql.js script" href="http://www.thefigtrees.net/lee/sw/sparql.js" target="_blank">sparql.js</a> script. Quite possibly I need to write some sort of API to the dataset to allow other queries to be run.</p>
<p>My friend, <a title="Simon Biles' Linked In page" href="http://uk.linkedin.com/in/simonbiles" target="_blank">Simon Biles</a> who owns <a title="Thinking Security website" href="http://thinking-security.com/" target="_blank">Thinking Security</a>, and I have been talking about a Knowledge Management project which is slightly aligned with some stuff I&#8217;ve been thinking about storing research pages for RSS and web pages. He&#8217;s thinking in terms of MS Office documents which means a little investigation into the various types of structured storage in Office and the ways that Office has changed to mine different types of documents. It does appear at first glance though that newer versions of Office and Open Office are similar in terms of finding the metatadata being collections of XML documents in an archive.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/08/weeknotes-ubuntu-messaging-and-open-correspondence/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating bibliographic resources from web pages</title>
		<link>http://austgate.co.uk/2010/08/creating-bibliographic-resources-from-web-pages/</link>
		<comments>http://austgate.co.uk/2010/08/creating-bibliographic-resources-from-web-pages/#comments</comments>
		<pubDate>Sun, 15 Aug 2010 18:52:52 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[archiving]]></category>
		<category><![CDATA[warc]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=126</guid>
		<description><![CDATA[Given the increasingly digital nature of research, including not only websites but blogs, forums, wikis, the (in my view), beloved moleskin is becoming increasingly outdated. I&#8217;ve just finished writing my first book and had the joy of using moleskin notebooks to note down urls and make notes. I like moleskins a lot but pen and [...]]]></description>
			<content:encoded><![CDATA[<p>Given the increasingly digital nature of research, including not only  websites but blogs, forums, wikis, the (in my view), beloved moleskin is  becoming increasingly outdated.<br />
I&#8217;ve just finished writing my first book and had the joy of using moleskin notebooks to note down urls and make notes. I like moleskins a lot but pen and paper does have its limitations when searching. I also bookmarked pages but changing computers has lost a few of these.</p>
<p>I&#8217;m just starting the research on a new book and looking around for any open source / free software to capture a url, mark it with the time accessed (for later bibliographical purposes), capture the raw HTML, and possibly allow me to tag it for folksonomical reference if I want. What would be sort of cool is to have an interface to share the results later or just post an XML / RDF file to be posted later.</p>
<p>I suppose what I essentially want to find is something along the lines of a moleskin for electronic notes? I can see various subscription services listed but I really want something on the desktop to create  a relevant project archive to later share. Potentially this does add to the issue of lots of mini-silos by creating more but if , in <a title="Bibliographica website" href="http://bibliographica.org/" target="_blank">Bibliographica</a> style, they could be linked or linkable, I think it could be an interesting way of sharing research links or allowing bodies to create a meta-frame calling from the shared resources.</p>
<p>I think that this falls into the realm of archiving, which poses issues in the UK, especially when it concerns commercial sites as my reading of the consultation has it. Wired UK has an <a title="Wired UK on archiving websites" href="http://www.wired.co.uk/news/archive/2010-03/05/archiving-britain's-web-the-legal-nightmare-explored.aspx" target="_blank">article on the issues of archiving web sites</a> in Britain and the legal difficulties therein. The British Library has been working on an archive (including some from shops no longer extant) but can only archive the site if the copyright holder has given permission. Even<a title="Archive of PDF on digital archiving" href="http://webarchive.nationalarchives.gov.uk/+/http://www.culture.gov.uk/images/consultations/Digital_legal_deposit.pdf" target="_blank"> the consultation paper</a> (itself archived now) is vague on this.</p>
<p>Ultimately this will hobble research if ways of noting and sharing the relevant data and metadata cannot be found to allow sharing and relevant notation. It would also mean that I&#8217;m left to the vagaries of my browser or remembering to make a note of the link in a new moleskin.</p>
<p>Building something along the lines of what I want might create a tool which other people might find useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/08/creating-bibliographic-resources-from-web-pages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

