Over the last couple weeks I've been spending a little time exploring NLP concepts with GATE (General Architecture for Text Engineering). Overall it's a very useful tool, with a few drawbacks.
GATE has two main aspects. First it provides a framework to pipe documents through layers of processing plugins. Second it offers a few useful plugins and wrappers for existing external applications.
The processing pipeline can be either a single stack, or a conditional one. The later allowing for relatively complex 'applications'. As a document moves through the stack, 'annotations' are attached to it. In many cases plugins down the stack rely on previous plugins.
By default you can play with the ANNIE (a Nearly-New Information Extraction system) stack. It loads a very useful pipeline stack including plugins for parsing parts of speech, annotating keywords found in dictionary files, a basic JAPE grammar, and a Orthographic Coreferencer.
Every word in the document found in one of the dictionaries is annotated with the dictionary name it came from. This is useful if you have entities and words you want to call out like abbreviations, titles, dates, and common names.
JAPE is a basic rule language. It's based on CPSL (a Common Pattern Specification Language). With JAPE you can define matching expressions that when fired can assign a new annotation to the range (or sub-range) matched. The patterns can match on characters in the document, or existing annotations created previously.
This is useful where you can have a plugin dictionary that identified common first names as a FirstName annotation, and a JAPE rule that tags "Mr.", "Mrs." and "Dr." as a Title. Later in the JAPE grammar you can match for "Title FirstName Token" where Token is a annotation representing the next 'word' (created by the part of speech and sentence splitter plugin). I'm not doing the language justice here, so don't take this as a tutorial.
Along with simply assigning a annotation on a match, arbitrary Java code can be executed on the match allowing for more complex annotation manipulation. Sadly this makes the grammar files hard to read and brittle, it would be good to see common functions get pushed back down into the grammar. Looking through some of the existing rules, it seems there might be an opportunity here.
One problem with JAPE is the clutter that builds up as the rules get more complex. This complexity stems from later rules depending on earlier rules. Early rules make temporary annotations that are used later, then must be explicitly discarded by other rules.
For example, you may have a few rules that create a TempPerson annotation, and one later rule that looks for TempPerson and Unknown annotations and tries to formulate a Person annotation. The next step is to remove the various TempPerson annotations, and the Unknown ones if they were sucked into the Person annotation.
I'm not experienced enough to offer any real suggestions, but having a rule scope or 'exports' (only allow named annotations out of the grammar file context) might help a bit.
Also allowing for 'imports' of rule files into a top level grammar. For example, the Person grammar would import the Name, JobTitle, FirstName, and Gender grammars. A second Organization grammar would import it's relevant bits.
Combining this with exports, annotations not explicitly exported (temporary ones) would be filtered out.
A last comment on JAPE would be around matching inside existing annotations.
Lets say you have an annotation named Person over the string "Dr. John Doe" defined in a grammar further down the stack. And you want to layer additional meta-data on this string like his Vocation. From what I can tell, you must add a new rule very early in the stack that both recognizes a Title, and gives it the meta-data vocation == "doctor". Then a (one or all of many) Person rule needs to be modified to pull the vocation meta-data up into the new annotation. This forces the rules to get more complex, and generally requires you to switch from the JAPE syntax to using Java.
If you could match on Person which happens to have a Title sub annotation, you can test if the Title denotes a vocation, and add the new meta-data vocation to the Person annotation. No previous rules would need to be modified. An easier alternative might be for the Title rule to add the vocation meta-data, but allow Person to inherit this meta-data from it's sub-annotations. Obviously I haven't thought this through too deeply, but there is room for improvement here.
Lastly is the Orthographic Coreferencer which simply looks for all the possible occurrences of a Person in a document, and associates the ones it believes are references to the same Person. For example, "Dr. John Doe" and "Dr. Doe" could be names referencing the same Person.
The only issue I have with the implementation is that a new Entity isn't created inheriting the properties of the individual occurrences. This Entity could easily be a resource attached to the document that references the individual annotations and offers a convenience to pulling out alternate names, meta-data, etc. In practice, a List is created with all the ids of the annotations and is assigned as a meta-data property to the annotations. This turns out to be a bit cumbersome to use.
Regardless, I recommend you check out GATE. So far, it seems like the only tool of it's kind (that I know of and that is freely available). If not, please comment on others I should look at.
I'm just beginning to evaluate tools like this right now. It looks like there is another free one: http://incubator.apache.org/uima/index.html
I'd be interested to hear your thoughts on it.
Yeah, I'll give that one a look over. Thanks!