Sense Clustering

This documentation is for the current version, starting from JoBimText 0.0.12. For the documentation of older versions consider Chinese Whispers 0.0.8–0.1.2.

 Chinese Whispers clustering

Sense Clustering is performed using Chinese Whispers (CW). You will need the current JoBimText pipeline. You also need a Distributional Thesaurus for clustering. For execution, run this command, when you are in the JoBimText pipeline folder:

java -cp lib/org.jobimtext-*.jar:lib/* org.jobimtext.sense.ComputeSenseClusters -i path/dt-file -o output-file -N 200 -n 100

This command will produce a clustered DT file with the following format:

word <tab> ID <tab> cluster-term1, cluster-term2, ... <tab> min_rank

These are the most important parameters:

Parameter Description
-a Node weighting parameter
1 = constant, lin = linear, log = logarithmic
-N The number of DT entries to process for each term (e.g. 100 or 200)
-n The number of entries to process for each DT entry term (e.g. 50, 100 or 200)
-i input file
-o output file

There are also some optional parameters that can be used to produce better similarity clusterings, e.g. by discarding too small or not very similar clusters.

Parameter Description
-ms Minimal similarity of the cluster element to the head word, e.g. 5 (default 1)
-mc Minimal cluster size, e.g. 3 (default 0)
-mr Maximal cluster rank. The cluster rank is the rank of the cluster element
with the highest similarity to the headword in the similarity list, e.g. 100.
The other cluster elements can have higher cluster ranks.

If you want to label the sense clusters with ISAs, follow the documentation of Sense Labelling.

DT Pruning

Note that some DT graphs are too large and therefore do not fully fit into memory. Then you can use infrequent or not very similar entries. You can use the pruning Python script Prune.py.

The script verifies that the DT entries have a minimal wordcount that is specified as the third parameter.  The input files  (wordcount and DT file) can be either gzipped or plain text files.

 python Prune.py WordCount.gz DT__SimSortlimit_200.gz 100 | grep -vP "\t[0-9](\.[0-9])?$" > dt_pruned.txt

This will only keep entries with a word count higher than 100 and afterwards prune DT entries by similarities, in this case similarities between 0 and 9 are pruned.

Leave a Reply

Your email address will not be published. Required fields are marked *