Archive for the ‘projects’ Category

Weeknotes: Open correspondence

Sunday, September 5th, 2010

A quiet week as I’ve been having a few days off but I’ve been working on some of the tickets for Open Correspondence.

The urls have changed to /letters/view/<author>/<correspondent>/<letter id> in an attempt to make them more user friendly and also to allow the user to define smaller or larger collections by altering the url. There is also some basic content negotation to expose the data in json, xml and rdf as well as html.

I’ve been trying to use the linked data api specs to make that the site operates as a service as well. Multi-tasking really.

One of the next things to do is to develop graphical features for the correspondent pages and to sort out the RDF with rdflib. Think I’ll try to do that next week though.

Weeknotes: Ubuntu, messaging and Open Correspondence

Sunday, August 29th, 2010

It has been a while since the last weeknotes. I’ve finally made the move to Linux, or at least dual booting, by installing Ubuntu so I’m currently learning a little the OS and getting a development environment set up for it.

I’ve nearly finsihed the ongoing accounts project at work. The framework is up and it went through testing over the last couple of weeks. There are a few rough edges and some bugs which still need fixing but it largely seems to be there now.

I’ve also installed the first part of a messaging server written in PHP (taking ideas and concepts from JMS and Python’s Routes for service urls) which takes a message from the core CMS system and routes them to the correct service using SOA. If there’s an issue with the service then it logs it and queues the message using Redis (athough an array might be quicker, I wanted the queue decoupled from the server if it failed or had to be restarted and the memory was wiped). I need to finish up the worker to dequeue at certain points in time but it is expected that I’ll get it finished in about four days once I’m back at work.

I’ve done one or two things on the Open Correspondence site as well. I’ve tidied up the source XML and the sources XML as well to expose them so I need to update the site itself. The next thing I think we need to do is to start writing stuff to expose the underlying data and to show what you can do with the data. One of the things that I want to do is to write a function which I can put behind either Protovis or Javascript Infovis Toolkit to convert a SPARQL query into the relevant JSON and I’m thinking of using Lee Feigenbaum’s sparql.js script. Quite possibly I need to write some sort of API to the dataset to allow other queries to be run.

My friend, Simon Biles who owns Thinking Security, and I have been talking about a Knowledge Management project which is slightly aligned with some stuff I’ve been thinking about storing research pages for RSS and web pages. He’s thinking in terms of MS Office documents which means a little investigation into the various types of structured storage in Office and the ways that Office has changed to mine different types of documents. It does appear at first glance though that newer versions of Office and Open Office are similar in terms of finding the metatadata being collections of XML documents in an archive.

Creating bibliographic resources from web pages

Sunday, August 15th, 2010

Given the increasingly digital nature of research, including not only websites but blogs, forums, wikis, the (in my view), beloved moleskin is becoming increasingly outdated.
I’ve just finished writing my first book and had the joy of using moleskin notebooks to note down urls and make notes. I like moleskins a lot but pen and paper does have its limitations when searching. I also bookmarked pages but changing computers has lost a few of these.

I’m just starting the research on a new book and looking around for any open source / free software to capture a url, mark it with the time accessed (for later bibliographical purposes), capture the raw HTML, and possibly allow me to tag it for folksonomical reference if I want. What would be sort of cool is to have an interface to share the results later or just post an XML / RDF file to be posted later.

I suppose what I essentially want to find is something along the lines of a moleskin for electronic notes? I can see various subscription services listed but I really want something on the desktop to create  a relevant project archive to later share. Potentially this does add to the issue of lots of mini-silos by creating more but if , in Bibliographica style, they could be linked or linkable, I think it could be an interesting way of sharing research links or allowing bodies to create a meta-frame calling from the shared resources.

I think that this falls into the realm of archiving, which poses issues in the UK, especially when it concerns commercial sites as my reading of the consultation has it. Wired UK has an article on the issues of archiving web sites in Britain and the legal difficulties therein. The British Library has been working on an archive (including some from shops no longer extant) but can only archive the site if the copyright holder has given permission. Even the consultation paper (itself archived now) is vague on this.

Ultimately this will hobble research if ways of noting and sharing the relevant data and metadata cannot be found to allow sharing and relevant notation. It would also mean that I’m left to the vagaries of my browser or remembering to make a note of the link in a new moleskin.

Building something along the lines of what I want might create a tool which other people might find useful.

Weeknotes: Talks, Open Correspondence, XMPP

Sunday, July 25th, 2010

I gave a talk at the Oxford Geek Nights about Open Correspondence and letters. At some point I really ought to learn how to give talks. Anyhow Russell Davies was the main speakers and he showed how you could make physical objects from data derived from social networks. (He has a marvellously sane post about the Raoul Moat facebook page.) Anyhow its gathered some people who are interested in contributing. Now I’ve finished the book, I’ve got more time to make changes to the codebase which urgently needs it. Finishing off stuff really. Then making the real changes.

Accounts has been slightly on hold since the wages needed to be run and I didn’t see that accounts or operations would be happy with fugures potentially changing.

The main project has been setting up a notification service to set up the service layer correctly. I’ve finally got the server working so I’m just building a framework. I thought of porting parts of djabberd projects into PHP but I’m just  looking at parts of it but XMPP is certainly a useful tool in getting machines to speak to each other and to develop event driven services.

Weeknotes: All quiet on the accounting front

Sunday, June 27th, 2010

It’s been a week of relative frustration with priorities suddenly being shifted and the infrastructure road map looking more and more unclear.

The soap server is largely debugged and ready for more extensive testing on the server and the back end has now been rewritten to capture more data. I cannot help feeling that it will change once more services go online to scale more efficiently but right now I don’t have the expertise to do it. I’ll get there.

On a different tack, I’m back on the accounting project that I was on several months ago and making some headway in that. Its grown since I was last involved in it but nothing that a decent set of specs and roadmaps cannot solve in terms of making it manageable.

I’ve been thinking about my next book project which is on the New Weird and genre over the last 15 years and wondering how to use dbpedia’s influencedBy and influence terms in terms of showing how writers influence each other over a century. I’m tempted to put the data into a large rdf sheet and then use javascript or PHP to transform it into JSON to see if you can use the Simile timeline software usefully or if I need to find / write something more appropriate. It does have to wait for me to finish the current book.

I forgot to link to the Open Correspondence blog post on the Open Knowledge Foundation’s blog which was posted a few days ago.

Weeknotes: Data, service buses and trac

Sunday, June 13th, 2010

I’ve succumbed and I’ve got a microslot at the next Oxford Geek Nights where I’m talking about the Open Correspondence website. I’ve downloaded the rest of the Gutenberg copies of the Dickens letters but just need an evening to make some headway with transforming them.

I spent a fair amout of this week trying to get a status update server built using SOAP and PHP which has been an ‘interesting’ task but seems to have finally got there. Having done some debugging at home on Friday, I’ve got to test the whole thing on Monday on the test server.

I’ve also  been debugging the csv uploads into the database and refactoring the code so that there is more re-use of similar objects. On top of that I started the documentation for the services and realised that I’d written most of the upload service for invoices as well. Bonus… So all I need do really is to spend a couple of days finishing  things of at work so that the first versions of the services can go out.

Whilst doing all of that though, I realised that the queueing system that I was working on was only part of a solution to get all of our services working together. Instead of just queuing, I need to start thinking more along the lines of an enterprise service bus. So that’ll keep me busy then for a couple of weeks. My notebook has various notes and doodles, much to my boss’s enjoyment who thinks its all old-fashioned.

I’ve also started putting together a Trac instance for work to see if it scales and helps with ticketing and information acorss our department’s groups. It’ll probably be sidelined for this week whilst I try and get everything put together again with regards to the data uploads.

Weeknotes: Redis, PHP, mail and SOAP

Sunday, June 6th, 2010

I’ve spent some time writing a queueing library using Redis as a backend. I started with the notion that it would need to be a FIFO queue but didn’t want to only use the in-built parts of PHP as a stack using array_pop or array_push. Whilst it might be faster, it doesn’t allow for queue storage if the worker / router calling the queue does not run until a certain time so I looked at Redis. I  drew some inspiration from MEMQ, a queue implementation using memcached. I wrote a quick set of functions to handle connection, enqueuing and dequeueing with the ever present Rediska as the underlying Redis connection library. I’m tempted to revisit this and to write my own connection to remove the reliance on Rediska. What I did learn was how to increase and decrease the number of items that could be dequeued. For some stupid reason, I’d got into my head that it would either by one or all items.

However if you think about the LLEN command, you can pop as many items as you want, drop them into an array and iterate across them. I need to try this but you could feasibly call items from the middle of the array by changing the start and end points in LLEN. Normally I’d do something like  <list name> LLEN 0, -1 for all items or <list name> LLEN 0, 2 for the first two but if you change 0 to something else where you know there are 30 items but only want 5 from position 20 then you could pop in LLEN 20, 5 to achieve the result. It is not really germaine to the queueing that I’ve been looking at (for system updates where I need everything or just the first item) but could be a useful adaptation for somebody else.

The main challenge this week has been reading Excel attachments from email. PHP’s imap library  allows you to read the structure of an email but is curiously reticent in retrieving data if you have mime parts. I spent ethe best part of a day and a half getting a script to iterate over an incoming email, filter the parts so that it just explored the attachments mime type and then retrive any attachments either from a flat structure or iterating over each part before calling imap_fetchbody(). So far the fix appears to work and has allowed me to create a prototype mail service for receiving email data. It seems odd that in the era of web services that financial data is still sent by insecure methods but we must accomodate.

I’ve also been looking at PHP’s SOAP library to create a status update service which will probably utilise Service Orientated Architecture to create a stable, scalable service. Initially I created a WSDL file using the Eclipse IDE but that threw all sorts of issues and ended up using Zend’s WSDL generator tool running across the existing server. Must look into this but there might be a conflict in versions of WSDL as well as first time learning curve. I’m hoping to get the first version of the service up this week.

I suspect that this week is going to complete the commission and service status services as well as possibly doing some documentation as it is beginning to pile up.

Weeknotes: Data mining, XML and bibliographies

Sunday, May 23rd, 2010

It seems to be have been a week of frantic completion and refactoring.

The first half was spent frantically converting html pages into PDFs using Verypdf’s HTMLtools server product. All in all the manual is very helpful and the test server could be set up quickly. It might have helped the other end if I’d remembered to break the file up for printing but that turned out to be a 10 minute jub to put back into production. The next task is to transfer it from the test server and onto the production one but that’ll need to wait for networking to tweak it a little.

I spent some time refactoring the call recordings archive. For some reason the archiving solution that I hacked up in November decided to start failing in March after it was changed. Despite being put back to its original state it never quite got back to working as it did. I’ve been trying to tweak it ridon and off but never found the time to complete it. I finally just made the time on friday afternoon to look at it properly. I’d been thinking about item based filtering after reading the first chapter of Toby Segaran’s Programming Collective Intelligence. (On the back of this, I think I’ll be buying his Beautiful Data at some point.)  Although this is not really an intelligent programme as such, the techniques have shown some real promise in the hurried tests. Using a Redis datastore, the percentage of found recordings is way up. Fingers crossed for Monday morning when I can see what the scripts run over the weekend. I also spent some time simplifying the matching algorithm so that I didn’t have to account for so many edge cases when dealing with time.

It seems that we are approaching some sort of real-time status update systems at work. I’ve sort of been arguing for this for a while to remove the bottlenecks of having each system dependant on another one. One of our suppliers is sending us XML data so I’ve been playing with Xpath 1.0 (since Xpath 2.0 apparently isn’t directly supported by PHP but there might be a way of passing the data to Java which adds unnecessary overhead) to extract the relevant values. Anyhow the core is running but I still need to fully test it and add in security.

I’ve also been asked to design and implement a queueing system for the main internal server. I’ve run up a quick high level overview but the detail still needs to be worked on. I’m pushing it back to June so that I can slear the decks of the older projects that are still on the board.

I had a chat with Jonathan Gray, a sound guy who does far too much, about digital humanities ideas. We’ve agreed to keep closer contact with each other about the area and to encourage each other into actually doing stuff (I have half a moleskin of ideas – time for more code, less talk then).  He proposed the Bibliographica idea in January and the team wrote a blog entry for the Open Knowledge Foundation blog. It is an idea that I’m looking forward to playing with and trying to embed data from. (http://bibliographica.org/)

One of the things that I’ve been thinking about though is increasingly when we do research, we store  web pages, blog entries and so on. Whilst there is way of recording these in a footnote (http:example.org accessed on <insert data> type thing), there does not appear to be a way of building a local archive of these with the relevant metadata for later retrieval, Don’t know about anybody else but I’ve got a fair few pages dotted around my hard drive for projects and I’d like a way of storing these properly and to be able to integrate them into bibliographies or research notes. I know that there is WARC format (Library of Congress link and the WARC tools Google code project) to play with so need to make time to do that.

I had a mini-hack on the Open Correspondence project last Sunday intending to update a couple of pages and got a little more done than that. The database needs rebuilding but the purl reference (http://purl.org/letter) now points to the schema. It is so close that I can’t wait to actually start hacking the data. Time to do the last little bits like tidy up the parser, use the weaving history API to embed a timeline and start using JENA, ARC and Chris Gutteridge’s Graphite library which worked out of the box (but as yet I haven’t entirely used it for much yet).

Goals for this week are to finish the Open Correspondence bits, update the trac instance with the various ‘todo’s, write a blog post for the Open Knowledge Foundation for Open Correspondence, do some major testing this week at work on various XML exports and imports. I should just be about caught up then. With any luck…

Weeknotes: Redis, RDF, rdflib and openletters

Saturday, May 15th, 2010

I’ve been trying to play catch up this week at work.

One of the projects that I’ve been working on is the temporary storage of information. For one reason or another, one of the workers has decided to occasionally throw a fit and not do its job properly (on top of a connection that appears to fail at odd times). What I really needed was a temporary store to save the parsed information so that if something failed, we didn’t loose everything. To that end, I’ve started looking at Redis in more detail and started using the Windows build of version 1.2.1 (available on aspninja.com) with the Rediska library. At some point I’ll sit down and compile it on my laptop under Cygwin to get the latest version.

I ended up using the PEAR version of Rediska and managed to get it up and running fairly quickly. One of the things that I needed to do was to call a new instance of the list that I was creating in each method, having split the set and get methods into two workers. The speed of Redis is fantastic and the server happily runs on the test server caching the data and allowing another worker to load into a copy of the MySQL tables that it will eventually update. I found the Rediska library really easy to use and I’ll be using it for various projects at home to do some processing rather than using MySQL all the time. Simon Willison has a post which links to a tutorial on Redis that I found extremely useful and encouraging in finding more about the server in future.

I’ve been working on the RDF exports for the open letters project which are yet to go live. The main job has been making sure that the exports validate using the RDF validator and pulling in the data. A future task is to finish tidying up the data but I’m trying to get the letter html template figured out. Since Python isn’t the main language that I know use (work is entirely based on PHP), I’ve been taking a look at the Open Shakespeare code and found that RDFa work that I worked on a year ago and completely forgotten about. It would be good to get RDFa into open correspondence but I think that is a later task. Main thing is to complete the initial port. I managed to get the www.purl.org/letter forwarding to the site but need to get a schema page up and the purl correctly referring to the right page.

One of things that I’ve been trying to play with RDFlib on Windows. I built it successfully on my last laptop (Windows XP, Cygwin) but for some reason version 2.4.2 would not build on Vista, even under easy install. I’ve been trying with the version 3 (which has just been released on may 13th according to the news group) and apparently the rdfextras project has a pure Python version of the Sparql parser which was failing to build. I’ll be trying that once the current work on open correspondent as been completed to explore what we can do with the data.

Ben O’Steen talked at the Open Knowledge conference after me and one of the things he talked about was the psutils package. I’ve found it on Cygwin and downloaded it so it would be good to have fun with that one or to find accessible Windows ports for people who don’t necessarily want to download Cygwin.

A change to the Letters project

Sunday, March 28th, 2010

During the previously blogged dinner with Ben and Rufus, we talked about the nascent work on the letters project. Both have “encouraged” me (it didn’t take too much persuasion, it must be said) to move the project to the Open Knowledge Foundation and to port it to Python with a Redis backend rather than the current PHP/MySQL set up. I hope that the move will be complete soon.