Rethinking the idea of the “text”

May 22nd, 2009

Is a text really stable? Is it entity? In a lecture during my final year at the University of Leicester, one of the English lecturers posed a a question: What is a text? After soliciting various answers from the masses, he argued that a text is anything - email, note, manuscript and so on. So let us assume this generalised view is so. However is the entity stable? Is it whole?

Think about the letter or email (there are enough similarities). It is an entity of text which has time/date, to, from, (sometimes) a subject, and a body. We take it as one thing.

However think about it. The date/time is an object. To and from are (generally) separate objects with (generally) separate identifiers. The bosy itself can contain other identifiers such as a nickname, reference to an act/event/letter/party and so on. Each thing can be seen as a single. So is a text really a collection of identifiers put together in a format?

If that view is taken, then how do we see a text? How can we approach it and what does it allow us to do with it? Instead of publishing as a platform, would it allow for the text itself as a platform or service?

Cory Doctorow on Creative Commons licensing

May 14th, 2009

Cory Doctorow has come up with a quick guide to self-serve licensing via Creative Commons which outlines the uses and advantages of the licence. The crux, apart from citation of sources, is what it allows users to do to use your data/craft/book/doohickey in innovative ways. From that both parties can learn from each other and possibly evolve new techniques or models. It is certainly a useful guide.

The changing community of publishing

May 13th, 2009

The New York Times had a piece on digital piracy of books and the contrasting views which was picked up by Slashdot. Starting out from the anti-piracy view, it does note that bestsellers are often the most pirated books which backs up Cory Doctorow’s assertion:

“I really feel like my problem isn’t piracy,…It’s obscurity.”

His own position of publishing free digital copies at the same time as the paid for “treeware” version comes out has helped the all important word of mouth get about his books. He has built a passionate community around his work who both download and pay for books. Through his acknowledgement that there will be cheap skates who will only download the free version but encouraging the rest of the community to be involved in discussing  and remixing his work, his latest novel stayed in the NY Times bestseller for seven weeks.

There must, however, be an acknowledgement that the creator has rights to the work. Doctorow uses Creative Commons to protect his original work but to allow users certain rights to do something with the work. The Open Definitions also do this. Through a simple transformation of rights as open shops rather than closed, i.e. changing to saying what you can do, rather than what you cannot, could change publishing and how it reacts to piracy.

So perhaps publishers need to accept that there will always be a certina amount of it going on. However they should not see piracy as open (it’s not and never will be). The challenge, I believe, for publishers is how to digitise and make available works to a community and allow the community to do things with the books and find new markets and models that way.

The transition would be rough and mistakes made but they need to happen. Publishing needs to learn the lessons of iTunes rather than seeing the digital world as Napster.

It would be great to link into publisher versions of books to create citations or from which to construct models in blogs and wikis using community licenses. It would allow for publisher works to be re-used, ideally be open but perhaps operate on micro-payments based on traffic or level of citation, and for the user to have some authentication (or not depending on publisher) of the data as coming from a reliable source.

Just a thought but the time is ripe for change and experimentation.

XML in Milton and Shakespeare

April 22nd, 2009

As part of the Open Milton project, I’ve been thinking about the place of  XML in it. Over Christmas, I wrote a small XSL transform using the Bosak XML Shakespeare files. Rufus took Anthony and Cleopatra and,  using Latex (I gather), created the Open Shakespeare Anthony and Cleopatra pdf.

At one level, this is yet another version of Shakespeare. True.

But think of the possibilities. A user could happily generate their own version of the play (for instance using it in a class) or create their own annotated version for that class and not have to worry too much about losing the text / book as it can be printed and shared widely. Communities of interested parties could be pointed towards a website where they could download the material either in final form or just get the XML to use it.

To some extent this is also about embracing a standard and making it common outside of academia and closed repositories. It would appear to be easier to share texts and make use of them if we know what the coding is going to be rather than have to wait for the download to complete before taking a look.

To that end, I’ve started a contribution (currently in prototype) to create a small parser so that we can start transforming text files into TEI (Text Encoding Initiative) Lite format. Granted it is at an early stage but the initial results show some promise and are encouraging (well for me at least).

As per Open Milton/Shakespeare, I’ve been using Python to do this with the minidom package with regular expressions. The next step will be to split out the script into  reader, parser and writer. I’ve been concentrating on drama but prose and verse have their own vocabularies so the parser will probably need to be split into three, each bit concentrating on a form and calling methods from the writer as appropriate.

Inviting Outlook users using open source systems

March 25th, 2009

I’m a happy bunny this morning with regards to calendaring. I’ve finally managed to solve why MS Outlook was ignoring the events sent with a timezone stamp. If I scheduled an event without specifying the time, then no time zone id is attached to the event so Outlook parses it quite happily. If I did set a time, say 10:00, then the timezone id for Europe/London is attached. I followed a suggestion on this thread on the ical4j forum.  Jari Oksanen’s page on iCal for Outlook suggests that Outlook does not like local timezones and asks whether Outlook can do with out them but that appears to be a very local solution. If, like us at JISCmail, you need to be able to service requests across a geographic range, then timezones are very important.

The full version of the vtimezone adds in RDATEs and EXDATEs for the various exceptions which Outlook does not appear to read, so you need to use a cut down version to ensure that all clients read the data (the ical4j package - download from Sourceforge - contains a several version of timezone headers which you  can either use with the package or as a basis for rolling your own).

One gotcha might be that the timezones for Outlook may not be 100% reliable and I’m assuming that this is due to the amount of fine-grained material that is missing from the vTimeZone header that you need to send. However, until the Outlook team provide better support for zTimeZones, the occasional error may well have to be lived with.

You must set a timezone so that clients can accept the invitation though but apperently this is on the list of things that the calDAV committee are looking at changing since it causes so many issues.

Depositing blogs - feeding repositories from blogging applications

March 19th, 2009

I’ve recently been working on a plugin for Wordpress to set up each post as RDF enabled using OAI_ORE and SWORD which I presented to the Oxon SWIG on Tuesday.

The Berlin Declaration of Open Access states the work should be free and also that it should be deposited in a repository. This seems to be about papers and articles but what about the use of blogs, wikis and even perhaps Twitter (might be a little stretch at the moment but I could see it being used)? That suggests a layer of data which could and, where practical, should be being archived in repositories as they are being used as open Laboratory notebooks with links to data.

The plug -in that I’m working on is designed to make blogs readable in RDF for the purposes of repository deposit.

At the moment, I have written a channel which lists all the blog’s posts (using the ?repository=site ) as well as individual post’s in RDF ( using ?repository=post&repository_id= postid). I’ve been using the SIOC exporter as the base model but I’m looking at using skos to get the categories and tags out of the Wordpress (and trying to leverage folksonomy through that). Next will be to look at the comments and trackbacks and using the isReferencedBy to export incoming links.

I’ve put this onto KnowledgeForge as its own project.

iCal4j and Outlook

December 17th, 2008

Nearly there with the Bedework project but just one last hurdle in terms of getting Outlook to see all the headers correctly. Squirrelsewer has one answer to the problem (basically Outlook 2007 doesn’t appear to like iCal headers very much) but I think it’s over engineered. Bedework’s core mailer takes a simpler approach so back to battle with that one.

Update: Actually there’s another step. You need to force the Mime mail header (text/calendar;method=request;charset=UTF-8) and also ensure that the Mime email message is set to alternative as well so that Outlook doesn’t get confused.

Re-use, Remix, Redistribute: Opening Knowledge

December 8th, 2008

I’m going to talk to you today about opening science and some of the ways that are being used to create platforms and tools and underlying responsibilities and actions that the commons needs to take if it is to develop a truly open way of working. Technology really is a means to an end; not the end itself and I think that we need to have some sort of culture shift towards respecting the commons again.
I’d like you to imagine a field about the size of a football pitch on the edge of a village. The field is populated by sheep, contently eating and sleeping. The field is common land and is shared by everyone in the village for the purposes of grazing. The grass is a finite resource which needs to be shared as does room for the animals to sleep or wander around.
The viability of the land depends on the ability of the users to share the resource equally. Each sheep needs to eat a certain amount of grass a day and so the land needs to be used equally.  If one owner allows its animals to use more resources than it ought to have equally, then another owner may also wish to allow their animals to overgraze, thus breaking the harmony that previously existed. The common land can only exist whilst there is a sense, tense or otherwise, of collective responsibility.
How to remedy this? Clearly, if nothing is done, the situation can only develop into an ever-decreasing spiral of retaliation - from a tit-for-tat leaving the sheep out longer than necessary to putting up physical boundaries around food and water resources.   What is needed is a sense of collective responsibility mediated by a sense of collective action. Each villager needs to understand that the ground is communal, that the resources need be shared equally for the commons to co-exist. In corollary, there must be either an understanding, either tacit or explicit, that there a rules and standards which seek to maintain the viability of the field, yet meting out punishment where necessary.
This problem is referred to as the Tragedy of the Commons, by people like Clay Shirky. It defines an issue which is becoming more prevalent with platforms allowing for data, knowledge and conversation to be shared and tools to allow distributed communities to form. Collective human behaviour is likely to break down in terms of protecting one’s own resources if there are no counter balances.

So, for now, I’d like to leave the sheep grazing in the pasture and turn to open knowledge.

The creations of tools such as blogs and wikis have changed the way that groups of people form and how they create collective governance and action. The technology for both have existed in the late 1990s though they have boomed in the last five years with Wikipedia opening the public way for wikis and services like Blogger and Wordpress allowing the ordinary user access to a publishing platform that Gutenberg can scarcely have conceived. The cloud offers storage and services, at a price, that enable groups of users to share commons interests from cheese-burger eating cats to science news. I’m not arguing that any of this is new. Clearly it is not but is an example of how the street re-uses technology, to paraphrase William Gibson.
What is novel is the way that this is changing the mechanisms for transferring knowledge and creating a collaborative working environment that challenges other industries in quite so quick fashion: publishing is the main industry affected. Yet there is also a social change that comes from what are now ubiquitous tools.  There is no novelty in using a blog or wiki but there are changing demands on the groups that use them, particularly when ideas and data are being discussed.
Open knowledge, at its most basic definition, is any knowledge (content, data or information) that you can use, re-use, share, redistribute and remix without restriction. It should be easily available, allow for modification and redistribution and not discriminate either groups of people or fields of endeavour (such as business), and avoid technological restrictions.
So what does that mean?
At one level, it develops and builds on the premise of open access. The Berlin Declaration of Open Access (http://oa.mpg.de/openaccess-berlin/berlindeclaration.html) is defines two conditions for the state to be achieved:
1)Firstly, that access to the work (defined as the published article, raw data) is “free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly” with full attribution.
2)The work to be deposited in an online repository which achieves the standards for inter-operability, archiving and distribution.

We can see that attribution and standards are applied as a right in this instance.
But there is an issue here in my mind which the Declaration does not fully support. It assumes that all work will be stored in a repository. Even though it can be made freely available, supported by attribution of the original article, it is predicated on the published work being made available.
What about wikis, blogs and other forms of publication? The Arxiv site is open access for pre-prints and articles and is peer-reviewed so that clear rubbish is not stored on the site. It has a collective governance which maintains its status as a premiere Physics site, which PloS also does for biology and chemistry. Even here though, the emphasis is on papers not the working notes, ideas and data; it is concerned with near finished pieces.
Tools such as Twitter and Friendfeed allow for instant messaging to an immediate group which can either digest or pass on the details. Wikis allow for the creation of a user-editable work page and communal or single user blogs can also be used for to share information.
Whilst they have created a useful platform to share information, two issues regarding these pop into mind.
Firstly, these tools are not designed for academics and researchers.  Whilst immediately useful, these are essentially text publishing tools rather than capable of expressing a mathematical or chemical formulae. The Open Knowledge Definition says that data should be open across technological barriers, yet one of these barriers is the platforms which are being used to create the platform for sharing. This has been designed not for academics but by media and entertainment to promote what the large majority of the populace use the Internet for: to share their own lives and passions.
There are patches to retrofit MathML into tools such as Mediawiki and Moin wiki tools and patches for blogging tools like Wordpress and Movable Type.  In the end they are patches and not natively supported. The Kinaspro blog commented in 2006:

We don’t need people telling us to tag our blog posts, we need an integrated solution. We need something that can draw structures and present them attractively in an index friendly HTML format.

( http://kinasepro.wordpress.com/2006/12/05/monday-night-ot-2/)
Certainly tagging (whether hierarchical or folksonomical) is useful but the data can be shared in a fashion limited by the barriers of today.
Yet even if these plug-ins or platforms are developed and made standard, search engines are still better at archiving and presenting text than images of formulae, chemical structures and the like. Image recognition is getting better but still leaves something to be desired. The semantic web offers opportunities to link like-minded repositories together to aid rapid discoverability of texts and articles on a subject. But it draws us back to the institutional silos which must link together to aid knowledge transfer.
That still leaves the issues with the Web and linking information derived through Instant Messaging, the blogosphere and wikis which remain in the Cloud.  At the moment, there is a technological restriction and part of that comes down to the data portability groups and conversations being publically driven by the media and entertainment industries.
Secondly, the problems with archiving this changing data. The Berlin Declaration’s second point is that texts and data need to be inserted into archives which are capable of producing the correct metadata for researchers to use.  Arguably repositories and efforts need to find a way of archiving RSS feeds so that the data is maintained even if the service is taken down and to also find a way of storing the change pages as well as the article for future researchers. This changing data needs to be stored with their changes for future references to evaluate the conversation and the methodology used.

A second, more fundamental to creating a shared pasture of knowledge, approach comes from Open Notebook which Cameron Neylon and Jean-Claude Bradley have been using to realise a different, open way of working. Instead of waiting until the research is fully realised, data, lab notes and workings are posted to the Internet via blogs and wikis linking to raw data files.
Clearly this raises issues:
1)The ability of a researcher to fraudulently claim data as their own and not attribute it in a correct fashion.
2)The ability to publish the data in an article at a later date for citation.
Publication is an issue which partially needs researchers and scientists to press for action either via persuasion and demonstration that openness of data does not equal poor quality science and research or, where practical, not publishing in those journals which avoid openness either through notebooks or access or who demand onerous copyright restrictions. Of course, one cannot avoid every journal which does this. It is not practical from a career perspective but a developing community can challenge this and make long terms changes to the use of citation and publication as a bench mark of quality rather than the research. Publishers need to make the separation in their own minds from the raw data that is on display and the methods which derived it, and the analysis and synthesis that the published (and, we hope, open access) paper gives.
Not every bit of data can be opened up. If a researcher is using patient data, then it needs to be anonymised so that confidentiality is maintained. Similarly with research derived from animal testing where this may lead to difficulties and danger for the researcher, institute or affiliated entities, such as family.

Sharing knowledge and ideas is a natural state for the commons.  Recently the Galaxy Zoo project which uses an open community to catalogue masses in the sky far more efficiently than computers. Each user is allowed access to the new data and then decides whether this is a cloud or spiral. The catalogue is then checked by professional astronomers  and agreed.  Out of this came the discovery of the Hanny’s Voorwerp. One of the amateur astronomers noticed an odd shaped mass in a cloud and described it as a Hanny’s Voorwerp , Dutch for object. This anomaly was confirmed by the professionals and the cloud is named after the amateur who found it yet it might well have been missed or taken years to discover without the large community. Yet the standards accepted by the community meant that the original person had the mass named after her.
Also the data, derived from the Sloan Digitial Sky Survey, is available to the community in entirety for the project. This ease of access to the original data for various astronomy projects allows for communities to form to understand and question to develop applications with the data. There may be agreements that once the data is used, then it is assigned a new version number or name with full attribution given to the original source.
Each community needs to establish a set of values to promote co-operation and also to create an environment to share ideas and methodology inside. Ideally a group should be a safe place to share ideas for later use, re-use or discussion. However there is a very real danger that these may be stolen by individuals or sub-groups and developed independently with no attribution returned.
Two things arise from this.
The group needs to create its own standards and values in terms of a discussion to create a value in itself as an entity. There is little that one can do to an individual if they go through with falsely attributing an idea to themselves, apart from community disapproval and possibly banishment. (Though this also creates issues in itself – can the group really take criticism or dissent?) A non-scientific version can clearly be seen in the recent Brand-Ross affair on the BBC where the BBC’s own standards now have another question mark against them which Cameron Neylon discusses on his blog,  Science in the Open (www.openwetware.org/scienceintheopen).
Platforms allow for expanded and more distribute forms of sharing, in turn allowing for the creation of looser groups which rely on data and knowledge than being in an institution. Pareto’s Law shows that the greatest contributions comes from a few members with the larger number of ideas and contributions. Roughly stated 20% of contributors come up with 80% of the work which can be shared. Now the platforms which allow the sharing do not in themselves actually define it and sharing the . The users do that through tagging, blogging, terminology, and keywords to enable search engines. Science adds to this through the underlying presentation of data and data sets with the analysis and working notes.

As the group comes together, there are usually two groups – a core group and the external members – depending on the amount of interest invested in the project. Out of this, one would argue that a group would begin to create its own community of practice which would inform itself of its norms.
Internally groups need to work together to create a culture that encourages the sharing of ideas, practices and data sets with each other and with external researchers and these are different for each group.
Sharing working practices and ideas does require a sense of trust between members. A user should be able to post an idea or thought without worry that it will be open to be stolen. At the very least, they should be able to use them with attribution. There is no way of enforcing this outside of adjusting or extending accepted practices. As groups get larger, it becomes more difficult to co-ordinate the members and each member has to be able to trust the others that they can post ideas or notes of thoughts towards creating or altering algorithms. Of course, one cannot expect that every member or reader of the information will assign the attribution correctly.
What can we do? Little really. Censure may or may not work but encouraging the notion of sharing and a cultural change may go towards changing the current user mindset regarding this and encourage a better practice of being clear about sources. A group, internally, may well know where the idea came from if it was claimed as something else and may raise this in conversation, challenging the user to retroactively claim attribution, or the group can look unfavourably and questioning future contributions, explicitly or not. Instead of using peer review at the informal level of the group than at the end of academic journey at the paper stage. (Update: Cameron Neylon has thoughts towards this posted on his blog.)
Licensing can be used as an aid. The use of the Open Knowledge or Open Service Definitions as well as Open Access or Creative Commons licensing to encourage sharing. The Open Definitions extend the sharing options allowed by the Create Commons Share Alike license. An issue that will be well known in the scientific field is the nature of commercial funding. Use of the Creative Commons Non-Commercial discriminates against the commercial world using the data. Publicly funded data then cannot be fully shared, even  though it ought to be available.
What is now a classic example of open data being used to advantage comes from the world of gold mining. Goldcorp mining decided to open up their secretive data for the region around Ontario with a reward for anybody who could find a substantial new seam. Out of the service they were given over a hundred new targets which yielded new seams, some of which they had not even considered.
Out of this openness something surprising may appear as well, which will be familiar to open source programmers. Since working practices are open and fully documented, then they could enforce stricter standards on the user’s methodology. Patches for the kernel are not accepted from newbies or programmers without a reputation for quality work and the same applies to science and other sets of research. The peer review process starts earlier and is involved in the development of the paper and research and should, perhaps, be integral to the group in its collective dealings.
The Commons exists today due to a changed platform which encourages it and in some cases even depends on it (think Wikipedia). It has become engrained into culture and so is more a useful way to create and enhance the research and data processes. The underlying culture of the Internet has moved away from Read Only and has rediscovered Read/Write. This shift can serve to add value to the original material source through the insights or queries of others.
In a halcyon world, each person can graze in peace on the Commons and can help to define what group standards and norms apply to maintain it. Its an ongoing process which each group works out for itself but using open standards at an earlier stage than publication to enhance the science life cycle. The Open Knowledge and Open Service definitions between them seek to allow scrutiny at most levels (barring personal data) across human and machine interfaces. Each user has their own need, their own questions on a data set and where they combine, then this creates the basis of a Commons, but there should not be discrimination against the user wanting to create their own query from it, however daft or mind bogglingly clever. Superficially this may be chaotic but a well-formed group will be able to use this to their advantage to deliver better science and research using the principles of open access, re-use, redistribution, restriction, and licensing that allows the end user the same freedom with the work without discrimination.

Privacy in group situations

December 8th, 2008

Clay Shirky, who is currently guest blogging on BoingBoing, has a link to a fantastic article by James Grimmelmann on “Facebook and the Social Dynamics of Privacy” which I’ve perused. I’ve been thinking about the nature of groups and how one keeps information and memberships from being inappropriately shared in uses such as scheduling events and it scares me as to how we are driven by what Facebook, MySpace, Google et alia want from us and how they define these relationships for their advertising/mining benefit.

Changing ways of learning

December 2nd, 2008

Wikinomics author Don Tapscott has an intriguing argument, reported in the Times this morning,  that

“Teachers are no longer the fountain of knowledge; the internet is … Kids should learn about history to understand the world and why things are the way they are. But they don’t need to know all the dates. It is enough that they know about the Battle of Hastings, without having to memorise that it was in 1066. They can look that up and position it in history with a click on Google”

Perhaps rote learning is a thing of the past and that application of knowledge is more beneficial.

I sort of wonder if this is really the balance between the “old education” and “new education” in that learning from the top down, i.e. via a teacher / class room situation, is becoming redundant as new knowledge stores come online. However I don’t think (and I doubt Tapscott believes) that the classroom is becoming redundant. What we need to do is to educate users how to actually use the use knowledge stores effectively for valid data and information retrieval, such as checking a source in at least two places, using a variety of sources, and still understanding the underlying concepts (such as the maths needed for chemistry and physics or appreciation of the text in literature or history).  It is not enough to search  but we need to be critical about what we are being fed.

Times are a changin’ and I think they look exciting. Reactionary voices calling for increased rote learning and failing educational standards are, in part I believe, missing this point and not engaging with these changes. What is important is that we do not let the classroom down whilst these changes are occuring.