For JoBimText pipelines below 0.1.3, consider this documentation:
Sense Labelling v. 0.1.0 — 0.1.2
You can annotate the Sense Clusters (that were computed by Chinese Whispers) with ISA labels. The ISA labels come from relations between words and can be extracted with PattaMaika.
Input files
The sense cluster file looks like this:
#WORD CID CLUSTERTERMS mouse 0 cat,dog,rat mouse 1 keyboard,joystick
The pattern file contains the patterns and their frequencies:
mouse ISA animal 15 cat ISA animal 10 dog ISA animal 20 dog ISA pet 5 keyboard ISA product 20 keyboard ISA input_device 2
Running the SenseLabeller
To label the sense clusters, you can use the SenseLabeller class from PattaMaika:
java -cp path/to/org.jobimtext.pattamaika-*.jar org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s sense-cluster-file -o output-file [optional parameters]
The SenseLabeller adds an additional column that contains the hypernyms for the cluster terms, ordered by frequency and score. The result looks like this:
mouse 0 cat,dog,rat animal:60, pet:5 mouse 1 keyboard,joystick product:20, input_device:2
Parameters
Option | Description | Default value |
---|---|---|
-p | Pattern file (PattaMaika) | none, required parameter |
-s | Sense cluster file (Chinese Whispers) | none, required parameter |
-o | Output file | none, required parameter |
-n | maximal number of labels for a cluster (top n) | 20 |
-mf | minimal pattern frequency in the pattern file | 2 |
-mm | minimal number of matches between cluster terms and a hypernym | 2 |
-ms | minimal score, see Scoring method | 30 |
-sep | separator between word and additional information, like POS tag; e.g word#NN or word/NN useful for parsed corpora, since ISA patterns are usually computed on word level |
null |
-tsep | separator between cluster terms (term separator); cluster terms are usually separated by comma or comma+space | ‘,’ (comma) |
-pf | POS filtering, can be used to only label certain parts-of-speeches, e.g. nouns: ‘^N.*’ | null |
Thus, to start Sense Labelling using a clustering computed from a parsed DT, you would use a command like the following one. Note, that it uses a custom term separator.
java -cp "jobimtext_pipeline_0.1.3/lib/*" org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s cluster_parsed-file -o cluster_parsed_labelled -sep '#'
If you want to only label proper nouns and names, you can specify a regular expression in the -pf parameter:
java -cp "jobimtext_pipeline_0.1.3/lib/*" org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s cluster_parsed-file -o cluster_parsed_labelled -sep '#' -pf '^N.*'
Scoring method
To demonstrate the scoring method, let’s have a look at the first sense cluster:
mouse 0 cat,dog,rat
Even though we find a direct entry for “mouse” (“mouse ISA animal”), we cannot use this relation for labelling, since mouse contains different senses. Therefore, only cluster terms are considered.
We find the following matching patterns for the cluster terms:
cat ISA animal 10 dog ISA animal 20 dog ISA pet 5
The scoring for the hypernym “pet” is straightforward: It has a count of 5 with “dog” and occurs as the hypernym for 1 cluster term, therefore the final score is: 5 * 1 = 5.
For animal, we find 2 matching cluster terms, “cat” and “dog”. The summed up pattern count is 30, therefore the final score is: (10+20) * 2 = 60.