<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Aust Gate &#187; data sets</title>
	<atom:link href="http://austgate.co.uk/tags/data-sets/feed/" rel="self" type="application/rss+xml" />
	<link>http://austgate.co.uk</link>
	<description>Open Knowledge and Literature</description>
	<lastBuildDate>Tue, 08 May 2012 20:33:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Weeknotes: Storing and cleaning data</title>
		<link>http://austgate.co.uk/2011/06/weeknotes-storing-and-cleaning-data/</link>
		<comments>http://austgate.co.uk/2011/06/weeknotes-storing-and-cleaning-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 13:15:13 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[weeknotes]]></category>
		<category><![CDATA[data sets]]></category>
		<category><![CDATA[node]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=350</guid>
		<description><![CDATA[This week has been soft launching a CRM system for the Janet project. Hopefully these would be just user bugs but it has highlighted some interesting data cleaning issues. These are going to be inherent in the exchange of data between two or more systems, especially when one is a long-term pre-existing one. This has [...]]]></description>
			<content:encoded><![CDATA[<p>This week has been soft launching a CRM system for the Janet project. Hopefully these would be just user bugs but it has highlighted some interesting data cleaning issues. These are going to be inherent in the exchange of data between two or more systems, especially when one is a long-term pre-existing one.</p>
<p>This has long-term implications in terms of continuing to ensure that the data is clean and standardised. Given that one of the forthcoming projects is based on our technical documents and converting them from existing formats (when these are fully confirmed) into the , as yet unbuilt or designed, system. As part of this I&#8217;ve been looking at the Chris Gutteridge&#8217;s <a title="Chris Gutteridge's Grinder" href="https://github.com/cgutteridge/Grinder" target="_blank">Grinder,</a> a parser for getting RDF data out of Excel and CSV files. I was reminded of Grinder whilst reading his article about Linked Data at the University of Southampton in the final ever Nodalities. Whilst Grinder itself may not be of initial use, it does give me some clues about the possibilities of transforming the data.</p>
<p>The project also forces me to think about how the programme would run and I suspect off the command line. If this is a safe assumption, then it means that I need to get back to Perl or use Python. Much as I like PHP, I&#8217;m not sure it is a command line language. I know it can be run as one but it always make me nervous as I don&#8217;t really consider it a system administration or data munging language. In either case, Perl and Python mean another re-learning curve, especially Perl which I last use at JISCMail a couple of years ago.</p>
<p>A side project that I&#8217;ve been  looking at is the real-time data storage of feeds for later mining and use. I&#8217;ve been thinking of using Node.js (and actually starting something!) and Redis to run in the background. A little side something, methinks. It does mean me learning more about Node though and gives me something tangible to build. I&#8217;ve been having a little search around the Net and came across an older post by <a title="Marshall Kirkpatrick on Realtime web" href="http://www.nten.org/blog/2009/10/28/ten-useful-examples-realtime-web-action" target="_blank">Marshall Kirkpatrick on the NTEN blog about realtime data</a> whilst reading about <a title="Elegant Code blogon node event loops" href="http://elegantcode.com/2010/11/19/taking-baby-steps-with-node-js-threads-vs-events/" target="_blank">event loops in Node on the Elegant Code</a> blog. Of course, once it is stored, it must be processed to be useful but that is the next step.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2011/06/weeknotes-storing-and-cleaning-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mining data driving the web?</title>
		<link>http://austgate.co.uk/2010/03/mining-data-driving-the-web/</link>
		<comments>http://austgate.co.uk/2010/03/mining-data-driving-the-web/#comments</comments>
		<pubDate>Wed, 17 Mar 2010 19:54:30 +0000</pubDate>
		<dc:creator>iain_emsley</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[data sets]]></category>

		<guid isPermaLink="false">http://austgate.co.uk/?p=128</guid>
		<description><![CDATA[Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing: the data age is less about the raw size of your data, and more about the cool stuff you can do [...]]]></description>
			<content:encoded><![CDATA[<p>Just seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the <a title="Bradford cross on data" href="http://techcrunch.com/2010/03/16/big-data-freedom/" target="_blank">growth of data</a> on the Web. He appears to argue that data and its uses will drive the Web soon, writing:</p>
<blockquote><p>the data age is less about the raw size of your data, and more about the  cool stuff you can do with it. Now that there is so much data, it is  time to unlock its value.</p></blockquote>
<p>It seems fairly straight forward given the lower barriers to growth and tools to create and access data.</p>
<p>There are issues with this such as learnng how to best leverage these for the user and to gain most benefit. It&#8217;ll certainly be an interesting time and Cross identifies a few technologies and ideas which may or may not gain currency but will spark debate nonetheless.</p>
]]></content:encoded>
			<wfw:commentRss>http://austgate.co.uk/2010/03/mining-data-driving-the-web/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

