Calculate a Distributional Thesaurus (DT)

This page describes the steps to calculate a Distributional Thesaurus from a text corpus.
In this description, we assume twitter as our corpus name and perform bigram holing.
The data should already be on HDFS in the twitter_bigram folder.

Perform Holing operation

Before calculating the DT, you need to perform the holing operation on your corpus.
For bigram-holing you can use the following shell script:

sh bigram_holing.sh twitter_bigram twitter_bigram__WordFeatureCount 2

The three required parameters are:

  • project folder
  • output folder for WordFeatureCounts
  • the field in the input files where sentence data is located. For a simple text file, use ‘0’.

Generate and Run Hadoop Script

You can generate the Hadoop script with a Python script and a few parameters: python generateHadoopScript.py [parameters]

Parameter Description Recommended/
example values
dataset Dataset name:
folder where the unprocessed corpus is located
your_corpus_folder
wc Word count: maximum number of
unique words a feature is allowed to have
1000
s Significance threshold:
lower threshold for significance scores
0
t Term threshold:
lower threshold for word-feature counts
2
p Feature count:
maximal number of features a word is allowed to have
1000
significance Significance measure: LMI (Lexicographers Mutual Information),
PMI (Pointwise Mutual Information) or LL (Log-Likelihood)
LMI
l Simsort count:
maximum number of similar terms for a term
200
queue (optional) Hadoop queue name; if no input is provided, “default” is used longrunning

Your command will look like this:

python generateHadoopScript.py project_folder 1000 0 2 1000 LMI 200

Then start your generated script and wait for it to finish: sh dataset_s0_t2_p1000_LMI_simsort200.sh During processing, it will create the following output folders. In these folders there are several part-*.gz files that use the tsv format (values exemplified underneath a folder).

  • dataset__PruneFeaturesPerWord_1000__FeatureCount
    feature count
  • dataset__PruneFeaturesPerWord_1000__WordCount
    word count
  • dataset__PruneFeaturesPerWord_1000__WordFeatureCount
    word feature count
  • dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0
    word feature significance_score count
  • dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000
    word feature significance_score count (pruned to contain only features that occur with at most 1000 words)
  • dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt
    feature word+ (list of words that share the feature)
  • dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures
    word1 word2 count

Copy Files to your Filesystem

To copy the results to your filesystem, use the HDFS -text command.
The -text command unzips the files on the fly. You can then pipe the content into a new local file. E.g:

hadoop dfs -text dataset__PruneFeaturesPerWord_1000__WordCount/p* | cat > dataset__WordCount

Database Access

If you want to access the DT using a database, you have to create the database and load the files into tables.

Create Database and Tables

You can create a MySQL database and tables using the using createTables.py. You need to provide the following arguments:

Argument Description example value
db_name The database name. This should be a name that is not yet present on the DB server. twitter_bigram
p The number of features per word (used in the DT calculation). 1000
sig_measure The significance measure used in the DT calculation, LMI, PMI or LL LMI
simsort_limit The maximum number of similar terms for one term. This is also a value that was used in DT calculation. 200

You can save the output into a file that you can use on your DB server:

python createTables.py twitter_bigram 1000 LMI 200 > create_tables.sql
mysql -u root -p < create_tables.sql

This script generates the following tables:

Table Fields Description
LMI_1000 word, feature, sig, count Stores counts and significance scores
for a word–feature combination
LMI_1000_l200 word1, word2, count Stores the most similar word2 items for a word1
and their corresponding count
word_count word, count Stores counts for each word
feature_count feature, count Stores counts for each feature

Import Results into a MySQL Database

To import the files into the database, you can use the provided script, import_data.sql.
You will need to change the paths to your DT files.

mysql -u root -p < import_data.sql

Now you can access your DT from a database by using the DatabaseThesaurusDatastructure class.

If you encounter problems with importing or using your DT, you can try Pruning to get it to work.

2 thoughts on “Calculate a Distributional Thesaurus (DT)

  1. Hi Chris,

    thank you for asking! The pruning scripts are provided with the source code. I have compiled step by step instructions for pruning: DT Pruning

    Good luck!

    Eugen

Leave a Reply

Your email address will not be published. Required fields are marked *