Weeknotes: Storing and cleaning data

This week has been soft launching a CRM system for the Janet project. Hopefully these would be just user bugs but it has highlighted some interesting data cleaning issues. These are going to be inherent in the exchange of data between two or more systems, especially when one is a long-term pre-existing one.

This has long-term implications in terms of continuing to ensure that the data is clean and standardised. Given that one of the forthcoming projects is based on our technical documents and converting them from existing formats (when these are fully confirmed) into the , as yet unbuilt or designed, system. As part of this I’ve been looking at the Chris Gutteridge’s Grinder, a parser for getting RDF data out of Excel and CSV files. I was reminded of Grinder whilst reading his article about Linked Data at the University of Southampton in the final ever Nodalities. Whilst Grinder itself may not be of initial use, it does give me some clues about the possibilities of transforming the data.

The project also forces me to think about how the programme would run and I suspect off the command line. If this is a safe assumption, then it means that I need to get back to Perl or use Python. Much as I like PHP, I’m not sure it is a command line language. I know it can be run as one but it always make me nervous as I don’t really consider it a system administration or data munging language. In either case, Perl and Python mean another re-learning curve, especially Perl which I last use at JISCMail a couple of years ago.

A side project that I’ve beenĀ  looking at is the real-time data storage of feeds for later mining and use. I’ve been thinking of using Node.js (and actually starting something!) and Redis to run in the background. A little side something, methinks. It does mean me learning more about Node though and gives me something tangible to build. I’ve been having a little search around the Net and came across an older post by Marshall Kirkpatrick on the NTEN blog about realtime data whilst reading about event loops in Node on the Elegant Code blog. Of course, once it is stored, it must be processed to be useful but that is the next step.