PattaMaika

Overview

The PattaMaika pattern engine is used to extract hierarchical patterns from a text corpus. It first executes the OpenNLP pipeline (segmenter, tagger, chunker) and uses the NP chunk annotations to extract patterns.

The pipeline components:

  • OpenNLP pipeline
    • OpenNlpSegmenter
    • OpenNlpPOSTagger
    • OpenNlpChunker
  • ChunkUpdater to remove stopwords like “other” from the beginning of the NP chunks
  • RutaEngineRunner to run the Ruta Script
  • SyntacticPatternsFreq, the output component which counts and outputs the patters

 UIMA Ruta

UIMA Ruta is a text analysis engine that can identify patterns of UIMA annotations and words. It is flexible and powerful, as it allows for complex and nested patterns (UIMA Ruta documentation).

In our setting, the ChunkUpdater gives us “FilteredNP” Annotations, that we use as _NP. We can simply consider them to be NP chunks.

IMPORT org.jobimtext.pattamaika.type.FilteredNP FROM TypeSystem  AS _NP;

Our implementation of Hearst patterns is based on these _NP annotations. A sentence like “She likes cats, small dogs and other animals”, would give us the following _NPs: cats, small dogs, animals.
The first Hearst patten Y, Y and other X would match the sentence. It looks like this in Ruta script:

(_NP (COMMA _NP)* ("and" | "or")  "other" _NP{->TEMP}) {-PARTOF(PATTERN)-> CREATE(PATTERN,"x"=TEMP)};

We look for an enumeration of _NPs and create a PATTERN from them. The last _NP is marked as “TEMP”, and afterwards as “x”, as it is the hypernym. We want to create a PATTERN only once, therefore, we check that this is the first time we see it: -PARTOF(PATTERN)

To create ISA annotations from these PATTERN matches, we process each PATTERN. The hypernym is parsed first, to exclude tautologies (X is a X). For each of the remaining _NPs, we create an ISA annotation with hyponym Y and hypernym X.

BLOCK(ForEach) PATTERN{} {
    
    PATTERN.x{PARSE(x)};
    x->TEMP_X;
    
    //creates ISA annotation for each desired _NP
    BLOCK(ForEach) _NP{} {
     _NP{AND(-PARTOF(TEMP_X),PARSE(nn))->CREATE(ISA,"y"=nn,"x"=x)};
    }
      
}

The pattern extraction is very fast. On a middle sized Hadoop cluster, it took only 4 hours to process the complete English Wikipedia. The summarization of pattern counts took about 30 minutes. Thus, the operation was complete in less than 5 hours on a Hadoop cluster.

Running PattaMaika

To extract the labels, consider this documentation: Pattern Extraction with PattaMaika

To label sense clusters, follow the instructions in Sense Labelling.

2 thoughts on “PattaMaika

Leave a Reply

Your email address will not be published. Required fields are marked *