<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Aust Gate &#187; Information Retrieval</title>
	<atom:link href="http://austgate.co.uk/category/informationretrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://austgate.co.uk</link>
	<description>Open Knowledge and Literature</description>
	<lastBuildDate>Sun, 25 Jul 2010 15:19:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Finding a space for NoSQL</title>
		<link>http://austgate.co.uk/2010/07/187/</link>
		<comments>http://austgate.co.uk/2010/07/187/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 19:11:26 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=187</guid>
		<description><![CDATA[ReadWriteWeb have a post on NoSQL (again?) by Audrey Watters which is a brief overview of the area.  The original post points the Heroku blog, where Adam Wiggins outlines the uses of NoSQL. I&#8217;m not an expert by any means but use Redis on a daily basis with the Rediska PHP library. I remember having [...]]]></description>
			<content:encoded><![CDATA[<p><a title="ReadWriteWeb on NoSQL" href="http://www.readwriteweb.com/cloud/2010/07/cassandra-predicting-the-futur.php" target="_blank">ReadWriteWeb</a> have a post on NoSQL (again?) by Audrey Watters which is a brief overview of the area.  The original post points the Heroku blog, where Adam Wiggins <a title="Heroku blog on NoSQL" href="http://blog.heroku.com/archives/2010/7/20/nosql/" target="_blank">outlines the uses of NoSQL</a>. I&#8217;m not an expert by any means but use Redis on a daily basis with the  Rediska PHP library. I remember having an argument with the IT director when I originally proposed using Redis but I&#8217;m glad that the gamble has paid off. The caching system that uses is now far more productive than the earlier version.</p>
<p>Our base is database is MySQL which I like a fair amount for what we do with it but all I needed do was to cache some data. The scripts write a fair amount of data to the cache and then there is one read process to read the entire list before updating the main database. At least I know that the data has some sort of security. It is not a panacea or similar cure all but it does have a place in development for certain jobs.</p>
<p>Best tool and all that?</p>
<p>I can understand why <a title="Cassandra, Twitter and NoSQL" href="http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html" target="_blank">Twitter are not using Cassandra</a> in the main service but are still using it for other projects.  For now. Systems and priorities change and perhaps it will happen in some way.</p>
<p>Despite its meteoric rise, NoSQL is not the answer to everything. It does have a useful place though.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/07/187/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Redis, PHP, mail and SOAP</title>
		<link>http://austgate.co.uk/2010/06/weeknotes-redis-php-mail-and-soap/</link>
		<comments>http://austgate.co.uk/2010/06/weeknotes-redis-php-mail-and-soap/#comments</comments>
		<pubDate>Sun, 06 Jun 2010 11:05:18 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[redis]]></category>
		<category><![CDATA[soap]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=164</guid>
		<description><![CDATA[I&#8217;ve spent some time writing a queueing library using Redis as a backend. I started with the notion that it would need to be a FIFO queue but didn&#8217;t want to only use the in-built parts of PHP as a stack using array_pop or array_push. Whilst it might be faster, it doesn&#8217;t allow for queue [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve spent some time writing a queueing library using Redis as a backend. I started with the notion that it would need to be a FIFO queue but didn&#8217;t want to only use the in-built parts of PHP as a stack using array_pop or array_push. Whilst it might be faster, it doesn&#8217;t allow for queue storage if the worker / router calling the queue does not run until a certain time so I looked at Redis. I  drew some inspiration from <a title="MEMQ blog post" href="http://abhinavsingh.com/blog/2010/02/memq-fast-queue-implementation-using-memcached-and-php-only/" target="_blank">MEMQ</a>, a queue implementation using memcached. I wrote a quick set of functions to handle connection, enqueuing and dequeueing with the ever present Rediska as the underlying Redis connection library. I&#8217;m tempted to revisit this and to write my own connection to remove the reliance on Rediska. What I did learn was how to increase and decrease the number of items that could be dequeued. For some stupid reason, I&#8217;d got into my head that it would either by one or all items.</p>
<p>However if you think about the LLEN command, you can pop as many items as you want, drop them into an array and iterate across them. I need to try this but you could feasibly call items from the middle of the array by changing the start and end points in LLEN. Normally I&#8217;d do something like  &lt;list name&gt; LLEN 0, -1 for all items or &lt;list name&gt; LLEN 0, 2 for the first two but if you change 0 to something else where you know there are 30 items but only want 5 from position 20 then you could pop in LLEN 20, 5 to achieve the result. It is not really germaine to the queueing that I&#8217;ve been looking at (for system updates where I need everything or just the first item) but could be a useful adaptation for somebody else.</p>
<p>The main challenge this week has been reading Excel attachments from email. PHP&#8217;s <a title="PHP's imap functions" href="http://php.net/manual/en/book.imap.php" target="_blank">imap</a> library  allows you to read the structure of an email but is curiously reticent in retrieving data if you have mime parts. I spent ethe best part of a day and a half getting a script to iterate over an incoming email, filter the parts so that it just explored the attachments mime type and then retrive any attachments either from a flat structure or iterating over each part before calling imap_fetchbody(). So far the fix appears to work and has allowed me to create a prototype mail service for receiving email data. It seems odd that in the era of web services that financial data is still sent by insecure methods but we must accomodate.</p>
<p>I&#8217;ve also been looking at PHP&#8217;s<a title="PHP's soap functions" href="http://php.net/manual/en/book.soap.php" target="_blank"> SOAP</a> library to create a status update service which will probably utilise <a title="Wikipedia on Service Orientated Architecture" href="http://en.wikipedia.org/wiki/Service-oriented_architecture" target="_blank">Service Orientated Architecture</a> to create a stable, scalable service. Initially I created a <a title="W3 on WSDL" href="http://www.w3.org/TR/wsdl" target="_blank">WSDL</a> file using the <a title="Eclipse ide" href="http://www.eclipse.org/" target="_blank">Eclipse IDE</a> but that threw all sorts of issues and ended up using Zend&#8217;s WSDL generator tool running across the existing server. Must look into this but there might be a conflict in versions of WSDL as well as first time learning curve. I&#8217;m hoping to get the first version of the service up this week.</p>
<p>I suspect that this week is going to complete the commission and service status services as well as possibly doing some documentation as it is beginning to pile up.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/06/weeknotes-redis-php-mail-and-soap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weeknotes: Data mining, XML and bibliographies</title>
		<link>http://austgate.co.uk/2010/05/weeknotes-data-mining-xml-and-bibliographies/</link>
		<comments>http://austgate.co.uk/2010/05/weeknotes-data-mining-xml-and-bibliographies/#comments</comments>
		<pubDate>Sun, 23 May 2010 10:57:25 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[open_bibliography]]></category>
		<category><![CDATA[open_correspondence]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=155</guid>
		<description><![CDATA[It seems to be have been a week of frantic completion and refactoring. The first half was spent frantically converting html pages into PDFs using Verypdf&#8217;s HTMLtools server product. All in all the manual is very helpful and the test server could be set up quickly. It might have helped the other end if I&#8217;d [...]]]></description>
			<content:encoded><![CDATA[<p>It seems to be have been a week of frantic completion and refactoring.</p>
<p>The first half was spent frantically converting html pages into PDFs using Verypdf&#8217;s<a title="VeryPDF htmltools command line manual" href="http://www.verypdf.com/htmltools/html-tools.html" target="_blank"> HTMLtools</a> server product. All in all the manual is very helpful and the test server could be set up quickly. It might have helped the other end if I&#8217;d remembered to break the file up for printing but that turned out to be a 10 minute jub to put back into production. The next task is to transfer it from the test server and onto the production one but that&#8217;ll need to wait for networking to tweak it a little.</p>
<p>I spent some time refactoring the call recordings archive. For some reason the archiving solution that I hacked up in November decided to start failing in March after it was changed. Despite being put back to its original state it never quite got back to working as it did. I&#8217;ve been trying to tweak it ridon and off but never found the time to complete it. I finally just made the time on friday afternoon to look at it properly. I&#8217;d been thinking about item based filtering after reading the first chapter of Toby Segaran&#8217;s <a title="OReilly page for Programming Collective Intelligence" href="http://oreilly.com/catalog/9780596529321/" target="_blank">Programming Collective Intelligence</a>. (On the back of this, I think I&#8217;ll be buying his <a title="O'Reilly page for Beautiful Data" href="http://oreilly.com/catalog/9780596157128/" target="_blank">Beautiful Data</a> at some point.)  Although this is not really an intelligent programme as such, the techniques have shown some real promise in the hurried tests. Using a Redis datastore, the percentage of found recordings is way up. Fingers crossed for Monday morning when I can see what the scripts run over the weekend. I also spent some time simplifying the matching algorithm so that I didn&#8217;t have to account for so many edge cases when dealing with time.</p>
<p>It seems that we are approaching some sort of real-time status update systems at work. I&#8217;ve sort of been arguing for this for a while to remove the bottlenecks of having each system dependant on another one. One of our suppliers is sending us XML data so I&#8217;ve been playing with Xpath 1.0 (since Xpath 2.0 apparently isn&#8217;t directly supported by PHP but there might be a way of passing the data to Java which adds unnecessary overhead) to extract the relevant values. Anyhow the core is running but I still need to fully test it and add in security.</p>
<p>I&#8217;ve also been asked to design and implement a queueing system for the main internal server. I&#8217;ve run up a quick high level overview but the detail still needs to be worked on. I&#8217;m pushing it back to June so that I can slear the decks of the older projects that are still on the board.</p>
<p>I had a chat with <a title="Jonathan Gray's blog" href="http://jonathangray.org/" target="_blank">Jonathan Gray</a>, a sound guy who does far too much, about digital humanities ideas. We&#8217;ve agreed to keep closer contact with each other about the area and to encourage each other into actually doing stuff (I have half a moleskin of ideas &#8211; time for more code, less talk then).  He proposed the <a title="Jonathan Gray on Bibliographica" href="http://austgate.co.uk/2010/01/bibliographica-open-bibliographic-sourcing-and-maintenance/" target="_blank">Bibliographica idea</a> in January and the team wrote <a title="Bibliographican entry on the blog" href="http://blog.okfn.org/2010/05/20/bibliographica-an-introduction/" target="_blank">a blog entry</a> for the Open Knowledge Foundation blog. It is an idea that I&#8217;m looking forward to playing with and trying to embed data from. (<a href="http://bibliographica.org/">http://bibliographica.org/</a>)</p>
<p>One of the things that I&#8217;ve been thinking about though is increasingly when we do research, we store  web pages, blog entries and so on. Whilst there is way of recording these in a footnote (http:example.org accessed on &lt;insert data&gt; type thing), there does not appear to be a way of building a local archive of these with the relevant metadata for later retrieval, Don&#8217;t know about anybody else but I&#8217;ve got a fair few pages dotted around my hard drive for projects and I&#8217;d like a way of storing these properly and to be able to integrate them into bibliographies or research notes. I know that there is WARC format (<a title="Library of Congress on WARC" href="http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml" target="_blank">Library of Congress</a> link and the <a title="WARC tools on Google code" href="http://code.google.com/p/warc-tools/" target="_blank">WARC tools</a> Google code project) to play with so need to make time to do that.</p>
<p>I had a mini-hack on the Open Correspondence project last Sunday intending to update a couple of pages and got a little more done than that. The database needs rebuilding but the purl reference (<a title="Letter schema PURL" href="http://purl.org/letter" target="_blank">http://purl.org/letter</a>) now points to the schema. It is so close that I can&#8217;t wait to actually start hacking the data. Time to do the last little bits like tidy up the parser, use the weaving history API to embed a timeline and start using <a title="jena sourceforge archive" href="http://jena.sourceforge.net/" target="_blank">JENA</a>, <a title="ARC website" href="http://arc.semsol.org" target="_blank">ARC</a> and Chris Gutteridge&#8217;s <a title="Graphite rdf library" href="http://graphite.ecs.soton.ac.uk/" target="_blank">Graphite</a> library which worked out of the box (but as yet I haven&#8217;t entirely used it for much yet).</p>
<p>Goals for this week are to finish the Open Correspondence bits, update the trac instance with the various &#8216;todo&#8217;s, write a blog post for the Open Knowledge Foundation for Open Correspondence, do some major testing this week at work on various XML exports and imports. I should just be about caught up then. With any luck&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/05/weeknotes-data-mining-xml-and-bibliographies/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data curation in real time</title>
		<link>http://austgate.co.uk/2010/04/data-curation-in-real-time/</link>
		<comments>http://austgate.co.uk/2010/04/data-curation-in-real-time/#comments</comments>
		<pubDate>Thu, 01 Apr 2010 20:29:16 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=148</guid>
		<description><![CDATA[Robert Scoble&#8217;s blog has this intriguing post on real-time curation which has made me think. At the moment I&#8217;m working in curating and archiving gigabytes of information at work (and usually on ways of generating more data from the systems). Whilst this is not necessarily real time, I&#8217;d like it to be or at least [...]]]></description>
			<content:encoded><![CDATA[<p>Robert Scoble&#8217;s blog has this intriguing post on <a title="Realtime data curation" href="http://scobleizer.com/2010/03/27/the-seven-needs-of-real-time-curators/" target="_blank">real-time curation</a> which has made me think. At the moment I&#8217;m working in curating and archiving gigabytes of information at work (and usually on ways of generating more data from the systems). Whilst this is not necessarily real time, I&#8217;d like it to be or at least happening on the same day.</p>
<p>I think that Scoble identifies the major challenges for data &#8211; bundling and updating. Relevance comes from context and its easy to create the bundles of data but you need to actually make it relevant, allow users to find it rapidly or allow the data to create its own relevance. Thought provoking post.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/04/data-curation-in-real-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Textcamp announced</title>
		<link>http://austgate.co.uk/2010/03/textcamp-announced/</link>
		<comments>http://austgate.co.uk/2010/03/textcamp-announced/#comments</comments>
		<pubDate>Sun, 28 Mar 2010 11:16:22 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[textcamp]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=143</guid>
		<description><![CDATA[Had dinner with Rufus Pollock and Ben O&#8217;Steen on Monday in Oxford. As part of the dicussions, the notion of Textcamp was raised and Ben has created the Textcamp website with an associated blog. It is a slightly bigger concept than I had had but the approach, I think, will allow the creation of a [...]]]></description>
			<content:encoded><![CDATA[<p>Had dinner with <a title="Rufus Pollock's website" href="http://www.rufuspollock.org/" target="_blank">Rufus Pollock</a> and <a title="Ben O'Steen's blog" href="http://oxfordrepo.blogspot.com/" target="_blank">Ben O&#8217;Steen</a> on Monday in Oxford. As part of the dicussions, the notion of Textcamp was raised and Ben has created the <a title="Textcamp website" href="http://textcamp.org/" target="_blank">Textcamp website</a> with an associated <a title="Textcamp blog" href="http://blog.textcamp.org/" target="_blank">blog</a>. It is a slightly bigger concept than I had had but the approach, I think, will allow the creation of a wider community and a place to publicly follow up any ideas that get thrown up. I like the idea of hacking texts as well and it will be great to have a place to discuss ideas and to learn. Equally Ben&#8217;s post makes it clear that it should be friendly and helpful leading up to a Barcamp style event. It is slated to run in August or September. I can&#8217;t wait.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/03/textcamp-announced/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Exporting and querying Dickens data</title>
		<link>http://austgate.co.uk/2010/03/exporting-data/</link>
		<comments>http://austgate.co.uk/2010/03/exporting-data/#comments</comments>
		<pubDate>Sun, 21 Mar 2010 12:15:35 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[charles dickens]]></category>
		<category><![CDATA[rdf]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=137</guid>
		<description><![CDATA[As a follow up to the posting regarding the propsed ontology, I&#8217;ve started to try and create a SPARQL endpoint. At some point soon, I want to use the new version of ARC as the version I&#8217;ve got here is a little out of date. After that the next thing should be to allow the [...]]]></description>
			<content:encoded><![CDATA[<p>As a follow up to the posting regarding the propsed ontology, I&#8217;ve started to try and create a <a title="Dickens SPARQL endpoint" href="http://austgate.co.uk/dickens/export.php?type=rdf&amp;author=Dickens" target="_blank">SPARQL endpoint</a>. At some point soon, I want to use the new version of <a title="ARC website" href="http://arc.semsol.org/" target="_blank">ARC</a> as the version I&#8217;ve got here is a little out of date. After that the next thing should be to allow the endpoint to be converted into other forms like JSON.</p>
<p>UPDATE: I&#8217;ve created an endpoint using the default ARC settings here: <a title="RDF endpoint for Dickens project" href="http://austgate.co.uk/dickens/endpoint.php" target="_blank">http://austgate.co.uk/dickens/endpoint.php</a></p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/03/exporting-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Growing and using data</title>
		<link>http://austgate.co.uk/2010/03/growing-and-using-data/</link>
		<comments>http://austgate.co.uk/2010/03/growing-and-using-data/#comments</comments>
		<pubDate>Wed, 17 Mar 2010 19:57:01 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[data mining]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=130</guid>
		<description><![CDATA[Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing: the data age is less about the raw size of your data, and more about the cool stuff you can do [...]]]></description>
			<content:encoded><![CDATA[<p>Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the <a title="Bradford cross on data" href="http://techcrunch.com/2010/03/16/big-data-freedom/" target="_blank">growth of data</a> on the Web. He appears to argue that data and its uses will drive the Web soon, writing:</p>
<blockquote><p>the data age is less about the raw size of your data, and more about the  cool stuff you can do with it. Now that there is so much data, it is  time to unlock its value.</p></blockquote>
<p>It seems fairly straight forward given the lower barriers to growth and tools to create and access data.</p>
<p>There are issues with this such as learnng how to best leverage these for the user and to gain most benefit. It&#8217;ll certainly be an interesting time and Cross identifies a few technologies and ideas which may or may not gain currency but will spark debate nonetheless.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/03/growing-and-using-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mining data driving the web?</title>
		<link>http://austgate.co.uk/2010/03/mining-data-driving-the-web/</link>
		<comments>http://austgate.co.uk/2010/03/mining-data-driving-the-web/#comments</comments>
		<pubDate>Wed, 17 Mar 2010 19:54:30 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[data sets]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=128</guid>
		<description><![CDATA[Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing: the data age is less about the raw size of your data, and more about the cool stuff you can do [...]]]></description>
			<content:encoded><![CDATA[<p>Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the <a title="Bradford cross on data" href="http://techcrunch.com/2010/03/16/big-data-freedom/" target="_blank">growth of data</a> on the Web. He appears to argue that data and its uses will drive the Web soon, writing:</p>
<blockquote><p>the data age is less about the raw size of your data, and more about the  cool stuff you can do with it. Now that there is so much data, it is  time to unlock its value.</p></blockquote>
<p>It seems fairly straight forward given the lower barriers to growth and tools to create and access data.</p>
<p>There are issues with this such as learnng how to best leverage these for the user and to gain most benefit. It&#8217;ll certainly be an interesting time and Cross identifies a few technologies and ideas which may or may not gain currency but will spark debate nonetheless.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/03/mining-data-driving-the-web/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bibliographica &#8211; open bibliographic sourcing and maintenance</title>
		<link>http://austgate.co.uk/2010/01/bibliographica-open-bibliographic-sourcing-and-maintenance/</link>
		<comments>http://austgate.co.uk/2010/01/bibliographica-open-bibliographic-sourcing-and-maintenance/#comments</comments>
		<pubDate>Sun, 24 Jan 2010 11:37:20 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Open Knowledge]]></category>
		<category><![CDATA[open_bibliography]]></category>
		<category><![CDATA[open_service]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=113</guid>
		<description><![CDATA[Jonathan Gray of the Open Knowledge Foundation has a thought provoking post on the need for an Open Bibliographic Service which he calls Bibliographica. As he writes: lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry [...]]]></description>
			<content:encoded><![CDATA[<p>Jonathan Gray of the Open Knowledge Foundation has a thought provoking post on the need for an Open Bibliographic Service which he calls <a title="Jonathan Gray on Bibliographica" href="http://jonathangray.org/2010/01/22/bibliographica/" target="_blank">Bibliographica</a>. As he writes:</p>
<blockquote><p>lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing. Books and articles are published and translated all the time. Works fall in and out of fashion. ‘Secondary’ reference works can become obsolete &#8211; considered interesting more for what they say about a particular intellectual period than what they say about their subject matter.</p></blockquote>
<p>I&#8217;ve been working on my own book as an independent researcher and wanted to know common books and articles in the area. As a user I wanted to know what was published in a particular area and what the points of commonality are to identify key works. Jonathan&#8217;s idea would be a help for this and, perhaps more importantly, provide a shared platform form.</p>
<p>As he identifies, sites like Amazon and LibraryThing allow for the user to create lists of books but over time, fashions change and books fall into and out of favour. Being able to compile searchable, sortable lists would allow the user to develop comprehensive lists (and also allow the intellectual historian to figure out zeitgeist&#8217;s from lists) and also realise the web&#8217;s potential for knowledge sharing which should go beyond mere surfing and into finding the source material and perhaps surprising links between data sets.</p>
<p>His specification, I think, offers a fertile starting point. It appears to source from and link to existing sources rather than re-invent the wheel and to also use existing technologies and ontologies like <a title="MARC website" href="http://www.loc.gov/marc/" target="_blank">MARC</a> and <a title="Dublin Core" href="http://dublincore.org/" target="_blank">Dublin Core</a>. I think that the specification is also sensible in its identification of users and groups to create and edit lists. It mentions that the service could be run by individual universities but what would be extremely useful (but perhaps would not happen) if these silos could then link to each other via interfaces to create continually updated communal resources rather than being individual silos.</p>
<p>Perhaps this is a slightly off topic thought but I&#8217;d love to know which books referred to each other, so that we could examine whether Foo writing Bar read the book by Baz which would be an indicator of influence.</p>
<p>The Bibliographica idea mixes &#8220;traditional&#8221; scholarship with crowd sourcing and is a sensible and potentially useful idea and service. I think it would need to build a critical mass of data and sources to be really useful but it could encourage use of resources.</p>
<p>UPDATE: Just one of those thoughts I had whilst making some lemon tea. Actually one of the challenges would be normalising the data sources to update the sources and pull in from the external sources.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/01/bibliographica-open-bibliographic-sourcing-and-maintenance/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Full text search using PHP and MySQL</title>
		<link>http://austgate.co.uk/2009/12/full-text-search-using-php-and-mysql/</link>
		<comments>http://austgate.co.uk/2009/12/full-text-search-using-php-and-mysql/#comments</comments>
		<pubDate>Tue, 29 Dec 2009 19:38:19 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=106</guid>
		<description><![CDATA[I&#8217;ve been thinking about full text searching for the letters project and trying to find various solutions that are open source. On the Open Shakespeare and Open Milton sites, we used the Xapian  project which is an excellent search engine. However I wanted to try and find a way of getting a search running using [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been thinking about full text searching for the letters project and trying to find various solutions that are open source. On the <a title="The Open Shakespeare site" href="http://openshakespeare.org/" target="_blank">Open Shakespeare</a> and <a title="Open Milton site" href="http://openmilton.org/" target="_blank">Open Milton</a> sites, we used the Xapian  project which is an excellent search engine. However I wanted to try and find a way of getting a search running using PHP and MySQL which is what the site uses at the moment although I&#8217;d be happy to also use Perl. (I also wanted to impose a limit to use technologies that I currently use at my current job.)</p>
<p>I started with reading an <a title="Zend article on full text searching with PHP and MySQL" href="http://devzone.zend.com/article/1304" target="_blank">article  on the Zend</a> site that offers an overview of setting up a table to run with a <a title="MySQL manual on full text searching" href="http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html" target="_blank">Full-Text</a> index. As the article mentions, you have to ensure that the column which is being searched is either VARCHAR or TEXT as MySQL. If it is not in either form, then just alter the column using</p>
<blockquote><p>ALTER TABLE &lt;tablename&gt; MODIFY &lt;column&gt; TEXT</p></blockquote>
<p>(or VARCHAR but TEXT is probably preferable). What the Zend article does not mention is that the table type needs to by MyISAM rather than use InnoDB (which means that transactions won&#8217;t work on the table). Having made the alteration, I ran the query:</p>
<blockquote>
<div>SELECT * , MATCH (&lt;table&gt;) AGAINST (&#8216;&lt;search term&gt;&#8217;) AS score FROM &lt;table&gt; WHERE MATCH (&lt;table&gt;) AGAINST (&#8216;&lt;search term&gt;&#8217;)</div>
</blockquote>
<p>The table returns all the columns with a score against the term.</p>
<p>The SQL code just needs calling as you would any other form of database code. I&#8217;m still playing with this but I&#8217;ve been ordering the table by the score descending (ORDER BY score DESC) so that the most relevant results are posted for the user.</p>
<p>I do think that I need to do some pre-processing on my own results set to highlight relevance and to extract further semantic meanings for results. For example the publisher &#8216;Chapman and Hall&#8217; that I could run on the Dickens letters (<a title="Dickens letters search example" href="http://austgate.co.uk/dickens/search.php?term=Chapman&amp;submit=Submit+Query" target="_blank">http://austgate.co.uk/dickens/search.php?term=Chapman&amp;submit=Submit+Query</a>) could equally pull up other businesses or people. I still need to write a parser that can make some sort of judgement even if it is a guess.</p>
<p>I&#8217;m sure as I carry on developing the engine and bringing everything together for the project, I&#8217;ll have further thoughts on the creation of an engine and creating a more advanced version. This does at least give me a start using current tools (though it is perhaps not as good as Xapian but sometimes you have to at least learn some of the basics).</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2009/12/full-text-search-using-php-and-mysql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
