Re-use, Remix, Redistribute: Opening Knowledge

I’m going to talk to you today about opening science and some of the ways that are being used to create platforms and tools and underlying responsibilities and actions that the commons needs to take if it is to develop a truly open way of working. Technology really is a means to an end; not the end itself and I think that we need to have some sort of culture shift towards respecting the commons again.
I’d like you to imagine a field about the size of a football pitch on the edge of a village. The field is populated by sheep, contently eating and sleeping. The field is common land and is shared by everyone in the village for the purposes of grazing. The grass is a finite resource which needs to be shared as does room for the animals to sleep or wander around.
The viability of the land depends on the ability of the users to share the resource equally. Each sheep needs to eat a certain amount of grass a day and so the land needs to be used equally.  If one owner allows its animals to use more resources than it ought to have equally, then another owner may also wish to allow their animals to overgraze, thus breaking the harmony that previously existed. The common land can only exist whilst there is a sense, tense or otherwise, of collective responsibility.
How to remedy this? Clearly, if nothing is done, the situation can only develop into an ever-decreasing spiral of retaliation – from a tit-for-tat leaving the sheep out longer than necessary to putting up physical boundaries around food and water resources.   What is needed is a sense of collective responsibility mediated by a sense of collective action. Each villager needs to understand that the ground is communal, that the resources need be shared equally for the commons to co-exist. In corollary, there must be either an understanding, either tacit or explicit, that there a rules and standards which seek to maintain the viability of the field, yet meting out punishment where necessary.
This problem is referred to as the Tragedy of the Commons, by people like Clay Shirky. It defines an issue which is becoming more prevalent with platforms allowing for data, knowledge and conversation to be shared and tools to allow distributed communities to form. Collective human behaviour is likely to break down in terms of protecting one’s own resources if there are no counter balances.

So, for now, I’d like to leave the sheep grazing in the pasture and turn to open knowledge.

The creations of tools such as blogs and wikis have changed the way that groups of people form and how they create collective governance and action. The technology for both have existed in the late 1990s though they have boomed in the last five years with Wikipedia opening the public way for wikis and services like Blogger and WordPress allowing the ordinary user access to a publishing platform that Gutenberg can scarcely have conceived. The cloud offers storage and services, at a price, that enable groups of users to share commons interests from cheese-burger eating cats to science news. I’m not arguing that any of this is new. Clearly it is not but is an example of how the street re-uses technology, to paraphrase William Gibson.
What is novel is the way that this is changing the mechanisms for transferring knowledge and creating a collaborative working environment that challenges other industries in quite so quick fashion: publishing is the main industry affected. Yet there is also a social change that comes from what are now ubiquitous tools.  There is no novelty in using a blog or wiki but there are changing demands on the groups that use them, particularly when ideas and data are being discussed.
Open knowledge, at its most basic definition, is any knowledge (content, data or information) that you can use, re-use, share, redistribute and remix without restriction. It should be easily available, allow for modification and redistribution and not discriminate either groups of people or fields of endeavour (such as business), and avoid technological restrictions.
So what does that mean?
At one level, it develops and builds on the premise of open access. The Berlin Declaration of Open Access (http://oa.mpg.de/openaccess-berlin/berlindeclaration.html) is defines two conditions for the state to be achieved:
1)Firstly, that access to the work (defined as the published article, raw data) is “free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly” with full attribution.
2)The work to be deposited in an online repository which achieves the standards for inter-operability, archiving and distribution.

We can see that attribution and standards are applied as a right in this instance.
But there is an issue here in my mind which the Declaration does not fully support. It assumes that all work will be stored in a repository. Even though it can be made freely available, supported by attribution of the original article, it is predicated on the published work being made available.
What about wikis, blogs and other forms of publication? The Arxiv site is open access for pre-prints and articles and is peer-reviewed so that clear rubbish is not stored on the site. It has a collective governance which maintains its status as a premiere Physics site, which PloS also does for biology and chemistry. Even here though, the emphasis is on papers not the working notes, ideas and data; it is concerned with near finished pieces.
Tools such as Twitter and Friendfeed allow for instant messaging to an immediate group which can either digest or pass on the details. Wikis allow for the creation of a user-editable work page and communal or single user blogs can also be used for to share information.
Whilst they have created a useful platform to share information, two issues regarding these pop into mind.
Firstly, these tools are not designed for academics and researchers.  Whilst immediately useful, these are essentially text publishing tools rather than capable of expressing a mathematical or chemical formulae. The Open Knowledge Definition says that data should be open across technological barriers, yet one of these barriers is the platforms which are being used to create the platform for sharing. This has been designed not for academics but by media and entertainment to promote what the large majority of the populace use the Internet for: to share their own lives and passions.
There are patches to retrofit MathML into tools such as Mediawiki and Moin wiki tools and patches for blogging tools like WordPress and Movable Type.  In the end they are patches and not natively supported. The Kinaspro blog commented in 2006:

We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format.

( http://kinasepro.wordpress.com/2006/12/05/monday-night-ot-2/)
Certainly tagging (whether hierarchical or folksonomical) is useful but the data can be shared in a fashion limited by the barriers of today.
Yet even if these plug-ins or platforms are developed and made standard, search engines are still better at archiving and presenting text than images of formulae, chemical structures and the like. Image recognition is getting better but still leaves something to be desired. The semantic web offers opportunities to link like-minded repositories together to aid rapid discoverability of texts and articles on a subject. But it draws us back to the institutional silos which must link together to aid knowledge transfer.
That still leaves the issues with the Web and linking information derived through Instant Messaging, the blogosphere and wikis which remain in the Cloud.  At the moment, there is a technological restriction and part of that comes down to the data portability groups and conversations being publically driven by the media and entertainment industries.
Secondly, the problems with archiving this changing data. The Berlin Declaration’s second point is that texts and data need to be inserted into archives which are capable of producing the correct metadata for researchers to use.  Arguably repositories and efforts need to find a way of archiving RSS feeds so that the data is maintained even if the service is taken down and to also find a way of storing the change pages as well as the article for future researchers. This changing data needs to be stored with their changes for future references to evaluate the conversation and the methodology used.

A second, more fundamental to creating a shared pasture of knowledge, approach comes from Open Notebook which Cameron Neylon and Jean-Claude Bradley have been using to realise a different, open way of working. Instead of waiting until the research is fully realised, data, lab notes and workings are posted to the Internet via blogs and wikis linking to raw data files.
Clearly this raises issues:
1)The ability of a researcher to fraudulently claim data as their own and not attribute it in a correct fashion.
2)The ability to publish the data in an article at a later date for citation.
Publication is an issue which partially needs researchers and scientists to press for action either via persuasion and demonstration that openness of data does not equal poor quality science and research or, where practical, not publishing in those journals which avoid openness either through notebooks or access or who demand onerous copyright restrictions. Of course, one cannot avoid every journal which does this. It is not practical from a career perspective but a developing community can challenge this and make long terms changes to the use of citation and publication as a bench mark of quality rather than the research. Publishers need to make the separation in their own minds from the raw data that is on display and the methods which derived it, and the analysis and synthesis that the published (and, we hope, open access) paper gives.
Not every bit of data can be opened up. If a researcher is using patient data, then it needs to be anonymised so that confidentiality is maintained. Similarly with research derived from animal testing where this may lead to difficulties and danger for the researcher, institute or affiliated entities, such as family.

Sharing knowledge and ideas is a natural state for the commons.  Recently the Galaxy Zoo project which uses an open community to catalogue masses in the sky far more efficiently than computers. Each user is allowed access to the new data and then decides whether this is a cloud or spiral. The catalogue is then checked by professional astronomers  and agreed.  Out of this came the discovery of the Hanny’s Voorwerp. One of the amateur astronomers noticed an odd shaped mass in a cloud and described it as a Hanny’s Voorwerp , Dutch for object. This anomaly was confirmed by the professionals and the cloud is named after the amateur who found it yet it might well have been missed or taken years to discover without the large community. Yet the standards accepted by the community meant that the original person had the mass named after her.
Also the data, derived from the Sloan Digitial Sky Survey, is available to the community in entirety for the project. This ease of access to the original data for various astronomy projects allows for communities to form to understand and question to develop applications with the data. There may be agreements that once the data is used, then it is assigned a new version number or name with full attribution given to the original source.
Each community needs to establish a set of values to promote co-operation and also to create an environment to share ideas and methodology inside. Ideally a group should be a safe place to share ideas for later use, re-use or discussion. However there is a very real danger that these may be stolen by individuals or sub-groups and developed independently with no attribution returned.
Two things arise from this.
The group needs to create its own standards and values in terms of a discussion to create a value in itself as an entity. There is little that one can do to an individual if they go through with falsely attributing an idea to themselves, apart from community disapproval and possibly banishment. (Though this also creates issues in itself – can the group really take criticism or dissent?) A non-scientific version can clearly be seen in the recent Brand-Ross affair on the BBC where the BBC’s own standards now have another question mark against them which Cameron Neylon discusses on his blog,  Science in the Open (www.openwetware.org/scienceintheopen).
Platforms allow for expanded and more distribute forms of sharing, in turn allowing for the creation of looser groups which rely on data and knowledge than being in an institution. Pareto’s Law shows that the greatest contributions comes from a few members with the larger number of ideas and contributions. Roughly stated 20% of contributors come up with 80% of the work which can be shared. Now the platforms which allow the sharing do not in themselves actually define it and sharing the . The users do that through tagging, blogging, terminology, and keywords to enable search engines. Science adds to this through the underlying presentation of data and data sets with the analysis and working notes.

As the group comes together, there are usually two groups – a core group and the external members – depending on the amount of interest invested in the project. Out of this, one would argue that a group would begin to create its own community of practice which would inform itself of its norms.
Internally groups need to work together to create a culture that encourages the sharing of ideas, practices and data sets with each other and with external researchers and these are different for each group.
Sharing working practices and ideas does require a sense of trust between members. A user should be able to post an idea or thought without worry that it will be open to be stolen. At the very least, they should be able to use them with attribution. There is no way of enforcing this outside of adjusting or extending accepted practices. As groups get larger, it becomes more difficult to co-ordinate the members and each member has to be able to trust the others that they can post ideas or notes of thoughts towards creating or altering algorithms. Of course, one cannot expect that every member or reader of the information will assign the attribution correctly.
What can we do? Little really. Censure may or may not work but encouraging the notion of sharing and a cultural change may go towards changing the current user mindset regarding this and encourage a better practice of being clear about sources. A group, internally, may well know where the idea came from if it was claimed as something else and may raise this in conversation, challenging the user to retroactively claim attribution, or the group can look unfavourably and questioning future contributions, explicitly or not. Instead of using peer review at the informal level of the group than at the end of academic journey at the paper stage. (Update: Cameron Neylon has thoughts towards this posted on his blog.)
Licensing can be used as an aid. The use of the Open Knowledge or Open Service Definitions as well as Open Access or Creative Commons licensing to encourage sharing. The Open Definitions extend the sharing options allowed by the Create Commons Share Alike license. An issue that will be well known in the scientific field is the nature of commercial funding. Use of the Creative Commons Non-Commercial discriminates against the commercial world using the data. Publicly funded data then cannot be fully shared, even  though it ought to be available.
What is now a classic example of open data being used to advantage comes from the world of gold mining. Goldcorp mining decided to open up their secretive data for the region around Ontario with a reward for anybody who could find a substantial new seam. Out of the service they were given over a hundred new targets which yielded new seams, some of which they had not even considered.
Out of this openness something surprising may appear as well, which will be familiar to open source programmers. Since working practices are open and fully documented, then they could enforce stricter standards on the user’s methodology. Patches for the kernel are not accepted from newbies or programmers without a reputation for quality work and the same applies to science and other sets of research. The peer review process starts earlier and is involved in the development of the paper and research and should, perhaps, be integral to the group in its collective dealings.
The Commons exists today due to a changed platform which encourages it and in some cases even depends on it (think Wikipedia). It has become engrained into culture and so is more a useful way to create and enhance the research and data processes. The underlying culture of the Internet has moved away from Read Only and has rediscovered Read/Write. This shift can serve to add value to the original material source through the insights or queries of others.
In a halcyon world, each person can graze in peace on the Commons and can help to define what group standards and norms apply to maintain it. Its an ongoing process which each group works out for itself but using open standards at an earlier stage than publication to enhance the science life cycle. The Open Knowledge and Open Service definitions between them seek to allow scrutiny at most levels (barring personal data) across human and machine interfaces. Each user has their own need, their own questions on a data set and where they combine, then this creates the basis of a Commons, but there should not be discrimination against the user wanting to create their own query from it, however daft or mind bogglingly clever. Superficially this may be chaotic but a well-formed group will be able to use this to their advantage to deliver better science and research using the principles of open access, re-use, redistribution, restriction, and licensing that allows the end user the same freedom with the work without discrimination.