This documentation is for the current version, starting from JoBimText 0.0.12. For the documentation of older versions consider Chinese Whispers 0.0.8–0.1.2.
Chinese Whispers clustering
Sense Clustering is performed using Chinese Whispers (CW). You will need the current JoBimText pipeline. You also need a Distributional Thesaurus for clustering. For execution, run this command, when you are in the JoBimText pipeline folder:
java -cp lib/org.jobimtext-*.jar:lib/* org.jobimtext.sense.ComputeSenseClusters -i path/dt-file -o output-file -N 200 -n 100
This command will produce a clustered DT file with the following format:
word <tab> ID <tab> cluster-term1, cluster-term2, ... <tab> min_rank
These are the most important parameters:
|-a||Node weighting parameter
1 = constant, lin = linear, log = logarithmic
|-N||The number of DT entries to process for each term (e.g. 100 or 200)|
|-n||The number of entries to process for each DT entry term (e.g. 50, 100 or 200)|
There are also some optional parameters that can be used to produce better similarity clusterings, e.g. by discarding too small or not very similar clusters.
|-ms||Minimal similarity of the cluster element to the head word, e.g. 5 (default 1)|
|-mc||Minimal cluster size, e.g. 3 (default 0)|
|-mr||Maximal cluster rank. The cluster rank is the rank of the cluster element
with the highest similarity to the headword in the similarity list, e.g. 100.
The other cluster elements can have higher cluster ranks.
If you want to label the sense clusters with ISAs, follow the documentation of Sense Labelling.
Note that some DT graphs are too large and therefore do not fully fit into memory. Then you can use infrequent or not very similar entries. You can use the pruning Python script Prune.py.
The script verifies that the DT entries have a minimal wordcount that is specified as the third parameter. The input files (wordcount and DT file) can be either gzipped or plain text files.
python Prune.py WordCount.gz DT__SimSortlimit_200.gz 100 | grep -vP "\t[0-9](\.[0-9])?$" > dt_pruned.txt
This will only keep entries with a word count higher than 100 and afterwards prune DT entries by similarities, in this case similarities between 0 and 9 are pruned.