Sense Clustering v 0.0.8–0.0.12

Sense Clustering is performed using Chinese Whispers (CW) and some preprocessing scripts. There are the 3 steps for clustering:

DT Pruning

As Chinese Whispers needs to store the complete graph in memory, it is advisable to prune the Distributional Thesaurus when working with large corpora. You can use the pruning Python script Prune.py.

The script verifies that the DT entries have a minimal wordcount that is specified as the third parameter.  The input files  (wordcount and DT file) can be either gzipped or plain text files.

 python Prune.py WordCount.gz DT__SimSortlimit_200.gz 100 | grep -v "<tab>[0-9](\.[0-9])?$" > dt_pruned.txt

Graph conversion

To convert the pruned DT into a CW compatible format, you can use the Perl script sim2cwformat.pl. It requires a DT as input and creates DT.nodes and DT.edges files.

perl sim2cwformat.pl dt_pruned.txt

Please make sure that the `edges’ file  is sorted by first column (numeric) ascending, third column (numeric) descending.

Chinese Whispers clustering

Now the sense clustering is ready to start. The ChiWhiDisamb.jar requires the following parameters:

Parameter Description
-a Node weighting parameter
1 = constant, lin = linear, log = logarithmic
-N The number of DT entries to process for each term (e.g. 100 or 200)
-n The number of entries to process for each DT entry term (e.g. 50, 100 or 200)
-o output file
-F Use input files specified by -i
-i Input files: nodes edges

There are also some optional parameters that can be used to produce better similarity clusterings, e.g. by discarding too small or not very similar clusters.

Parameter Description
-ms Minimal similarity of the cluster element to the head word, e.g. 5 (default 1)
-mc Minimal cluster size, e.g. 3 (default 0)
-mr Maximal cluster rank. The cluster rank is the rank of the cluster element
with the highest similarity to the headword in the similarity list, e.g. 100.
The other cluster elements can have higher cluster ranks.

Therefore, this is a setting for a good, rather coarse-grained clustering:

java -jar ChiWhiDisamb.jar -a 1 -N 200 -n 200 -ms 5 -mc 3 -mr 100 -o DT_sense_cluster -F -i dt_pruned.txt.nodes dt_pruned.txt.edges

Note that the complete data set size (nodes + edges) should not exceed 3GB. If it does, you can perform the pruning step with a higher minimal word count.

If you want to label the sense clusters with ISAs, follow the documentation of Sense Labelling.

Leave a Reply

Your email address will not be published. Required fields are marked *