Adding linguistic interfaces to Open Correspondence

I’ve been playing around with the Python NLTK package, in particular the WordNet interface. WordNet is hosted by Princeton University. I mentioned that I was going to look at this and the idea of allow a search for lemmas of a word. It came about from a question posed on Open Literature mailing list regarding whether you could search it with Lemmas.

Xapian does word stemming but not lemmas which are slightly different. In stemming, the word production should appear as produc* since produc is the base of the word. However that is nonsense. The base of the word is produce which is what the Wordnet Lemma returns.

Using the API notes, I’ve been playing around with the following:

from nltk.corpus import wordnet as wn

word_lem = set()
ret_lem = []
for i in wn.synsets(author):
[word_lem.add(lemma.name) for lemma in i.lemmas]

ret_lem = list(word_lem)

Having usedĀ  set to remove any duplicates, I can return the list of the lemmas that WordNet gives. Since you have to use a Synset if you don’t have the exact part of speech that a word is (Verb, Adverb, Adjective or Noun) since the lemma constructor requires that to produce the lemma. That’s fineĀ  and I can return the names using lemma.name but the part of speech is in the synset and I’m not sure how to retrieve it but it would be useful to send back so that a user can see the part of speech and determine whether it is of interest or not.

In the first instance though, I can return the related synsets to the user through an API, yet to be written, and link them to the Xapian search so that they can search for the term if of interest. It begins the opening up of the letters as a linguistic dataset since the tone and language might vary across the letters depending on the correspondent. One would expect letters to his family to be less formal than to a business colleague or fellow author. I’m aiming to have an early draft up shortly with some improved XML and JSON handling for the individual letters.

Given that I really did not do that well in the linguistics module at the University of Leicester, I’m surprised that this has been the first API module being developed. It makes sense though but I need to find a way of getting back to the original purpose of the site.