Sense Labelling

For JoBimText pipelines below 0.1.3, consider this documentation:
Sense Labelling v. 0.1.0 — 0.1.2

You can annotate the Sense Clusters (that were computed by Chinese Whispers) with ISA labels. The ISA labels come from relations between words and can be extracted with PattaMaika.

Input files

The sense cluster file looks like this:

#WORD    CID   CLUSTERTERMS
mouse    0    cat,dog,rat
mouse    1    keyboard,joystick

The pattern file contains the patterns and their frequencies:

mouse ISA animal    15
cat ISA animal    10
dog ISA animal    20
dog ISA pet    5
keyboard ISA product    20
keyboard ISA input_device    2

Running the SenseLabeller

To label the sense clusters, you can use the SenseLabeller class from PattaMaika:

java -cp path/to/org.jobimtext.pattamaika-*.jar org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s sense-cluster-file -o output-file [optional parameters]

The SenseLabeller adds an additional column that contains the hypernyms for the cluster terms, ordered by frequency and score. The result looks like this:

mouse    0    cat,dog,rat    animal:60, pet:5 
mouse    1    keyboard,joystick    product:20, input_device:2

Parameters

Option Description Default value
-p Pattern file (PattaMaika) none, required parameter
-s Sense cluster file (Chinese Whispers) none, required parameter
-o Output file none, required parameter
-n maximal number of labels for a cluster (top n) 20
-mf minimal pattern frequency in the pattern file 2
-mm minimal number of matches between cluster terms and a hypernym 2
-ms minimal score, see Scoring method 30
-sep separator between word and additional information, like POS tag; e.g word#NN or word/NN
useful for parsed corpora, since ISA patterns are usually computed on word level
null
-tsep separator between cluster terms (term separator); cluster terms are usually separated by comma or comma+space ‘,’ (comma)
-pf POS filtering, can be used to only label certain parts-of-speeches, e.g. nouns: ‘^N.*’ null

Thus, to start Sense Labelling using a clustering computed from a parsed DT, you would use a command like the following one. Note, that it uses a custom term separator.

java -cp "jobimtext_pipeline_0.1.3/lib/*" org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s cluster_parsed-file -o cluster_parsed_labelled -sep '#'

If you want to only label proper nouns and names, you can specify a regular expression in the -pf parameter:

java -cp "jobimtext_pipeline_0.1.3/lib/*" org.jobimtext.pattamaika.SenseLabeller -p pattern-file -s cluster_parsed-file -o cluster_parsed_labelled -sep '#' -pf '^N.*'

Scoring method

To demonstrate the scoring method, let’s have a look at the first sense cluster:

mouse    0    cat,dog,rat

Even though we find a direct entry for “mouse” (“mouse ISA animal”), we cannot use this relation for labelling, since mouse contains different senses. Therefore, only cluster terms are considered.

We find the following matching patterns for the cluster terms:

cat ISA animal    10
dog ISA animal    20
dog ISA pet    5

The scoring for the hypernym “pet” is straightforward: It has a count of 5 with “dog” and occurs as the hypernym for 1 cluster term, therefore the final score is: 5 * 1 = 5.

For animal, we find 2 matching cluster terms, “cat” and “dog”. The summed up pattern count is 30, therefore the final score is: (10+20) * 2 = 60.

Leave a Reply

Your email address will not be published. Required fields are marked *