For some scenarios, it might be advantageous to reduce the size of the DT, e.g. to fit it into memory or to remove noise. Thie page describes our best pratices in pruning DTs, which remove information a) regarding low-frequent terms and b) regarding low similarities, which are most probably random noise.
DT Pruning
The scripts mentioned on this page are available at the Subversion page.
Word and feature count
To prune your DT you first have to remove words and features with low counts. The Python script remove_below.py takes the following arguments:
- file to be processed
- column which should be checked
- threshold for filtering
In the following we check the second column (indices starting from 0) and prune all entries which have counts of 10 or lower:
python remove_below.py twitter2012__FeatureCount 1 10 > twitter2012__FeatureCount_pruned_10 python remove_below.py twitter2012__WordCount 1 10 > twitter2012__WordCount_pruned_10
From these pruned word and feature counts we can create a wordlist and a featurelist for further pruning:
cut -f1 twitter2012__WordCount_pruned_10 > twitter2012__wordlist cut -f1 twitter2012__FeatureCount_pruned_10 > twitter2012__featurelist
Distributional Thesaurus
In the next step we can prune DT entries with a low similarity (e.g. 3), the scores are in the third column (index 2):
python remove_below.py twitter2012__LMI_1000_l200 2 3 > twitter2012__LMI_1000_l200_pruned_3
This should filter out most of low quality entries. Additionally we can use the wordlist to remove superfluous DT entries, where the word counts are too low (and therefore missing). Therefore we use the remove_not_contained.py script.
It needs the word list as the first, and the (possibly pruned) DT as the second argument:
python remove_not_contained.py twitter2012__wordlist twitter2012__LMI_1000_l200_pruned_3 > twitter2012__LMI_1000_l200_pruned_3_w
Significance scores
We can also prune the significance scores between terms and features. In this operation we filter the file with significance scores (twitter2012__LMI_1000) by removing words that are not contained in our pruned word list.
The same operation is performed with the feature list. Now the significance scores only contain valid entries, for which we have the counts and DT entries.
python remove_not_contained_byColumn.py twitter2012_wordlist twitter2012__LMI_1000 0 > twitter2012__LMI_1000_pruned_w python remove_not_contained_byColumn.py twitter2012_featurelist twitter2012__LMI_1000_pruned_w 1 > twitter2012__LMI_1000_pruned_w_f
After the pruning is done, we can remove the word list and the feature list.
rm twitter2012__wordlist rm twitter2012__featurelist
We recommend the following settings for pruning of large DTs:
- WordCount: remove counts below 10
- FeatureCount: remove counts below 10
- LMI_1000_l200 (distributional thesaurus): remove scores below 3 or 5
additionally remove entries not in the pruned word list - LMI_100 (significance scores): remove entries not contained in the pruned word and feature list.