Pattern Extraction with PattaMaika

Locally

To test-run PattaMaika locally, create a folder 'corpus' in the jobimtext_pipeline-x.x.x folder. Then you can enter some sentences containing hierarchical patterns. You can use the sentences below, as they cover all Hearst patterns:

I saw black dogs, small yellow tigers, big horses and other animals.
They liked food like pizza, pretzels and hamburgers.
He dealt with technology, including old computers and keyboards.
She visited European countries, especially France, Spain and Germany.
There were also such pets as cats, bats or cute rats.

If you want to use PattaMaika on  another corpus, you will have to set the "InputPath" in the descriptor (./descriptors/PattaMaikaUIMAOperations.xml).

Then you can run PattaMaika with the following command:

java -cp "lib/*" org.jobimtext.util.RunJoBimIngestionLocal descriptors/PattamaikaUIMAOperations.xml

The output will be written to the folder specified in "OutputPath" in the descriptor, by default to 'pattern_out'. If you are dealing with large corpora, you might need to set the "AppendOutput" parameter to "true", otherwise the output will will be overwritten for each new CAS.

These should be the ISA patterns extracted from the test corpus:

pizza ISA food    1
small_yellow_tigers ISA animals    1
yellow_tigers ISA animals    1
pretzels ISA food    1
rats ISA pets    1
computers ISA technology    1
cats ISA pets    1
black_dogs ISA animals    1
dogs ISA animals    1
France ISA European_countries    1
Germany ISA countries    1
cute_rats ISA pets    1
France ISA countries    1
tigers ISA animals    1
hamburgers ISA food    1
keyboards ISA technology    1
Spain ISA countries    1
bats ISA pets    1
Germany ISA European_countries    1
horses ISA animals    1
old_computers ISA technology    1
big_horses ISA animals    1
Spain ISA European_countries    1

In the default setting, you will also get lemmatized patterns like "Spain ISA European_country" or "horse ISA animal".

If you want to combine the pattens extracted from several files or corpora, you can use the PatternMerger. The PatternMerger sums up the counts of all patterns. It needs the path to the pattern folder as an argument. The output folder of PatternMerger is "INPUT_merged", e.g. "pattern_out_merged" in our example.

java -cp lib/org.jobimtext.pattamaika*.jar org.jobimtext.pattamaika.PatternsMerger pattern_out

Eclipse

To run PattaMaika from Eclipse, check out the following projects:

  • org.jobimtext
  • org.jobimtext.thirdparty
  • org.jobmtext.pattamaika

You will find a run configuration in org.jobimtext.pattamaika, which runs the UIMA pipeline specified in ./descriptors/PattaMaikaUIMAOperations.xml

Per default it uses ./testcorpus_untagged/ as the input folder and ./pattern_out to store the extracted patterns. The Ruta script file is located in ./script/SyntacticPatterns.ruta. It features all Hearst patterns for English.

 Hadoop

To run PattaMaika on Hadoop, there is a Python script provided that creates the Hadoop pipeline, generatePattaMaikaHadoopScript.py. It only needs the dataset (folder of the corpus on HDFS) as an argument, optionally you can provide a queue:

python generatePattamaikaHadoopScript.py dataset [-q queue-name]

You can run the generated shell script. It will be stored under "dataset_pattern-extraction.sh". The script runs two operations: first the PattaMaika UIMA pipeline, and afterwards a UniqMapper/SumReducer that sums up the counts for each pattern.

In the shell script, you will find a commented line to read the patterns from HDFS and sort them by pattern frequency:

hadoop fs -text dataset_pattern/* | sort -k4nr > dataset_patterns_sorted.txt

You can use these patterns for Sense Labelling.

Leave a Reply

Your email address will not be published. Required fields are marked *