Finding the data signal in the noise

Marshall Kirkpatrick, on ReadWriteWeb, poses the question

A web of infinite information: does that sound like a scary problem of “just too much”?

in a “Mamas, Don’t Let Your Babies Grow Up to Be Data Wranglers” where he discusses an interview with Evan Williams on GigaOm. (I’m not going to discuss the interview here (but it is an interesting read).)

(I’m not sure that I can agree on the idea of the decentralised web being dead(I’m not sure it is) but the links between sites and services are becoming increasingly visible, sometimes deliberately so and sometimes because they are being used as a service. However I digress…)

In response to the Om Malik’s question: “You feel there is just too much stuff on the web these days?”, Williams responds:

There’s too much stuff. It seems to me that almost all tools we rely on to manage information weren’t designed for a world of infinite info. They were designed as if you could consume whatever was out there that you were interested in.

before identifying Twitter as a response to this and finding the signal in the noise.This response still requires the user to identify the signal that they are interested in and to follow it. Machine algorithms may not necessarily identify what a user is precisely interested in  but, from what I understand, they are getting better.

However we come back to the question that Tim Davies posed in the aftermath of the Open Data hackday. He identifies that that there are two approaches to Open Data data-led or problem-led. The data-led approach finds “some data of interest, and then explored what could be done with it.” and the problem-led is to start “with an issue to explore and then seeking data to work with”. Of course both have their issues (data-led can lose focus or problem led can stuggle to identify the relevant data).

This encapsulates the issue that Evan Williams identifies with the infinite amount of data on the web and how to make it useful. The approach still seems to be very much a data-led one rather than a problem-led one (though this is not to say that these approahces do not exist for apps or sites). The tools to usefull mine the existing data are only just being developed and it recalls something that I heard a while ago. The tools that we have now solve existing or previous problems but not tomorrow’s ones.  They are being developed by interested parties but somehow we need to get the data developers talking to the problem solvers to really make the existing data sets.

In the blurb to Phillip Janet’s Data Analysis with Open Source Tools, there is an appropriate line:

purpose is more important than process

So Marshall’s question might well be turned slightly around. Rather than looking at the notion of there being too much data, its one of how to identify the purpose to mine the noise for the signal. There is only too much information if you do not have a purpose in looking at it.