Sense Clustering is performed using Chinese Whispers (CW) and some preprocessing scripts. There are the 3 steps for clustering:
DT Pruning
As Chinese Whispers needs to store the complete graph in memory, it is advisable to prune the Distributional Thesaurus when working with large corpora. You can use the pruning Python script Prune.py.
The script verifies that the DT entries have a minimal wordcount that is specified as the third parameter. The input files (wordcount and DT file) can be either gzipped or plain text files.
python Prune.py WordCount.gz DT__SimSortlimit_200.gz 100 | grep -v "<tab>[0-9](\.[0-9])?$" > dt_pruned.txt
Graph conversion
To convert the pruned DT into a CW compatible format, you can use the Perl script sim2cwformat.pl. It requires a DT as input and creates DT.nodes and DT.edges files.
perl sim2cwformat.pl dt_pruned.txt
Please make sure that the `edges’ file is sorted by first column (numeric) ascending, third column (numeric) descending.
Chinese Whispers clustering
Now the sense clustering is ready to start. The ChiWhiDisamb.jar requires the following parameters:
Parameter | Description |
---|---|
-a | Node weighting parameter 1 = constant, lin = linear, log = logarithmic |
-N | The number of DT entries to process for each term (e.g. 100 or 200) |
-n | The number of entries to process for each DT entry term (e.g. 50, 100 or 200) |
-o | output file |
-F | Use input files specified by -i |
-i | Input files: nodes edges |
There are also some optional parameters that can be used to produce better similarity clusterings, e.g. by discarding too small or not very similar clusters.
Parameter | Description |
---|---|
-ms | Minimal similarity of the cluster element to the head word, e.g. 5 (default 1) |
-mc | Minimal cluster size, e.g. 3 (default 0) |
-mr | Maximal cluster rank. The cluster rank is the rank of the cluster element with the highest similarity to the headword in the similarity list, e.g. 100. The other cluster elements can have higher cluster ranks. |
Therefore, this is a setting for a good, rather coarse-grained clustering:
java -jar ChiWhiDisamb.jar -a 1 -N 200 -n 200 -ms 5 -mc 3 -mr 100 -o DT_sense_cluster -F -i dt_pruned.txt.nodes dt_pruned.txt.edges
Note that the complete data set size (nodes + edges) should not exceed 3GB. If it does, you can perform the pruning step with a higher minimal word count.
If you want to label the sense clusters with ISAs, follow the documentation of Sense Labelling.