This is the part two blog post of the Sirocco “modernization” series.
In the old, SharpNLP version of Sirocco, we used WordNet version 2.7 to look up the base forms, aka lemmas, of words. For example, if you have a verb “was”, its lemma, or base form, is “be”. Knowing base forms of words is important for us because we use pattern rules for English idioms that are defined using base forms of words. An example of an English idiom is “rub it in”, defined in Sirocco patterns as “rub/VB it/PRP in/IN”. This pattern allows us catching variations of the phrase, e.g. “rubbed it in”, “rubbing it in” etc.
WordNet is a large dictionary of English words (look up any word in this simple web UI) with a Java and many other libraries to access it programmatically. Its latest version is 3.1 and I was looking for a Java API that would allow me accessing the newest dictionaries and be able to run them in the cloud. The second requirement had a good reason. As I was reading about the different options, I realized that the best way to accomplish efficient execution in the cloud was to have the WordNet dictionary files packaged in a jar, and loaded as resources. I could not use WordNet dictionaries deployed as databases, or as files on the file system, as this would quickly break in the auto-scaled, auto-deployed world of modern clouds.
First, I looked at OpenNLP. OpenNLP version 1.6 supports WordNet via an add-on, which in turn is using the JWNL library , specifically, they have the JWNLLemmatizer class. If you want to use it in your project, add the following dependency to your POM (but read to the end of this blog post, before you do that).
<dependency>
<groupId>net.sf.jwordnet</groupId>
<artifactId>jwnl</artifactId>
<version>1.3.3</version>
<scope>compile</scope>
</dependency>
However, the JWNLLemmatizer from the 1.6 version of OpenNLP only returned a single base form for a word, which isn’t true for a significant portion of English words. Interestingly, the OpenNLP version 1.5.3 has the JWNLDictionary class which returns a list of base forms. I almost ended up using that class, but didn’t for reasons I will explain below.
After getting JWNLDictionary to work locally, I realized that I just could not deploy it to my preferred cloud provider. The OpenNLP add-on assumed it could access dictionary files directly on disk, which would not work with my data processing setup. After some researching I came across this Stackoverflow post which provided me with more options. I did not try every one, I admit, but I tried many. The package that did the trick was Extended JWNL, or ExtJWNL for short. I used version 1.8, and here is the necessary dependency declaration for your POM:
<dependency>
<groupId>net.sf.extjwnl</groupId>
<artifactId>extjwnl</artifactId>
<version>1.8.0</version>
</dependency>
<dependency>
<groupId>net.sf.extjwnl</groupId>
<artifactId>extjwnl-data-wn31</artifactId>
<version>1.2</version>
</dependency>
The second dependency is for the WordNet data files , btw.
Here is what you need to do to start using it.
import net.sf.extjwnl.dictionary.Dictionary;
... declare your Dictionary variable ...
private Dictionary wnDict = null;
... initializing in constructor ...
String propsFile = ConfigurationManager.getConfiguration().getString(“WordnetPropertiesFile”);
InputStream stream = getClass().getResourceAsStream(propsFile);
wnDict = Dictionary.getInstance(stream);
Then, from your methods that need to read the base forms (or lemmas) of words, you can query the Dictionary.
// calling the WN dictionary
POS fullposobj = POS.getPOSForLabel(fullpos);
List<String> bf = (List<String>) wnDict.getMorphologicalProcessor().lookupAllBaseForms(fullposobj,lowercaseLemma);
Here is how Sirocco is doing it in its BaseFormsDictionary class.
Btw, if you are wondering about the “WordnetPropertiesFile” property in the snippet above, here it is.
<?xml version="1.0" encoding="UTF-8"?> <jwnl_properties language="en"> <version publisher="Princeton" number="3.1" language="en"/> <dictionary class="net.sf.extjwnl.dictionary.FileBackedDictionary"> <param name="morphological_processor" value="net.sf.extjwnl.dictionary.morph.DefaultMorphologicalProcessor"> <param name="operations"> <param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/> <param value="net.sf.extjwnl.dictionary.morph.DetachSuffixesOperation"> <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/> <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/> <param name="adjective" value="|er=|est=|er=e|est=e|"/> <param name="operations"> <param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/> <param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/> </param> </param> <param value="net.sf.extjwnl.dictionary.morph.TokenizerOperation"> <param name="delimiters"> <param value=" "/> <param value="-"/> </param> <param name="token_operations"> <param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/> <param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/> <param value="net.sf.extjwnl.dictionary.morph.DetachSuffixesOperation"> <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/> <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/> <param name="adjective" value="|er=|est=|er=e|est=e|"/> <param name="operations"> <param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/> <param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/> </param> </param> </param> </param> </param> </param> <param name="dictionary_element_factory" value="net.sf.extjwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/> <param name="file_manager" value="net.sf.extjwnl.dictionary.file_manager.FileManagerImpl"> <param name="check_path" value="false"/> <param name="file_type" value="net.sf.extjwnl.princeton.file.PrincetonResourceDictionaryFile"/> <param name="dictionary_path" value="/net/sf/extjwnl/data/wordnet/wn31"/> </param> </dictionary> <resource class="net.sf.extjwnl.princeton.PrincetonResource"/> </jwnl_properties>
ExtJWNL supports multiple dictionary storage backends (Memory, Database, File Backed) and uses property files to parameterize them. I used, as you can see, the File Backed dictionary, and all the WordNet files are located in the extjwnl-data-wn31–1.2.jar that I deploy to my processing infrastructure when I run Sirocco in the cloud.
In my next blog post I will write a bit more about the parts of OpenNLP I use to do sentiment analysis on text.