Calculate a Distributional Thesaurus (DT)

This page describes the steps to calculate a Distributional Thesaurus from a text corpus.
In this description, we assume twitter as our corpus name and perform bigram holing.
The data should already be on HDFS in the twitter_bigram folder.

Contents

1 Perform Holing operation
2 Generate and Run Hadoop Script
3 Copy Files to your Filesystem
4 Database Access
- 4.1 Create Database and Tables
- 4.2 Import Results into a MySQL Database

Perform Holing operation

Before calculating the DT, you need to perform the holing operation on your corpus.
For bigram-holing you can use the following shell script:

sh bigram_holing.sh twitter_bigram twitter_bigram__WordFeatureCount 2

The three required parameters are:

project folder
output folder for WordFeatureCounts
the field in the input files where sentence data is located. For a simple text file, use ‘0’.

Generate and Run Hadoop Script

You can generate the Hadoop script with a Python script and a few parameters: python generateHadoopScript.py [parameters]

Parameter	Description	Recommended/ example values
dataset	Dataset name: folder where the unprocessed corpus is located	your_corpus_folder
wc	Word count: maximum number of unique words a feature is allowed to have	1000
s	Significance threshold: lower threshold for significance scores	0
t	Term threshold: lower threshold for word-feature counts	2
p	Feature count: maximal number of features a word is allowed to have	1000
significance	Significance measure: LMI (Lexicographers Mutual Information), PMI (Pointwise Mutual Information) or LL (Log-Likelihood)	LMI
l	Simsort count: maximum number of similar terms for a term	200
queue	(optional) Hadoop queue name; if no input is provided, “default” is used	longrunning

Your command will look like this:

python generateHadoopScript.py project_folder 1000 0 2 1000 LMI 200

Then start your generated script and wait for it to finish: sh dataset_s0_t2_p1000_LMI_simsort200.sh During processing, it will create the following output folders. In these folders there are several part-*.gz files that use the tsv format (values exemplified underneath a folder).

dataset__PruneFeaturesPerWord_1000__FeatureCount
feature count
dataset__PruneFeaturesPerWord_1000__WordCount
word count
dataset__PruneFeaturesPerWord_1000__WordFeatureCount
word feature count
dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0
word feature significance_score count
dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000
word feature significance_score count (pruned to contain only features that occur with at most 1000 words)
dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt
feature word+ (list of words that share the feature)
dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures
word1 word2 count

Copy Files to your Filesystem

To copy the results to your filesystem, use the HDFS -text command.
The -text command unzips the files on the fly. You can then pipe the content into a new local file. E.g:

hadoop dfs -text dataset__PruneFeaturesPerWord_1000__WordCount/p* | cat > dataset__WordCount

Database Access

If you want to access the DT using a database, you have to create the database and load the files into tables.

Create Database and Tables

You can create a MySQL database and tables using the using createTables.py. You need to provide the following arguments:

Argument	Description	example value
db_name	The database name. This should be a name that is not yet present on the DB server.	twitter_bigram
p	The number of features per word (used in the DT calculation).	1000
sig_measure	The significance measure used in the DT calculation, LMI, PMI or LL	LMI
simsort_limit	The maximum number of similar terms for one term. This is also a value that was used in DT calculation.	200

You can save the output into a file that you can use on your DB server:

python createTables.py twitter_bigram 1000 LMI 200 > create_tables.sql
mysql -u root -p < create_tables.sql

This script generates the following tables:

Table	Fields	Description
LMI_1000	word, feature, sig, count	Stores counts and significance scores for a word–feature combination
LMI_1000_l200	word1, word2, count	Stores the most similar word2 items for a word1 and their corresponding count
word_count	word, count	Stores counts for each word
feature_count	feature, count	Stores counts for each feature

Import Results into a MySQL Database

To import the files into the database, you can use the provided script, import_data.sql.
You will need to change the paths to your DT files.

mysql -u root -p < import_data.sql

Now you can access your DT from a database by using the DatabaseThesaurusDatastructure class.

If you encounter problems with importing or using your DT, you can try Pruning to get it to work.

2 thoughts on “Calculate a Distributional Thesaurus (DT)”

Chris says:

April 30, 2014 at 10:14 PM

How do you prune it? it does not fit into my memory in R!

euge says:

May 8, 2014 at 12:10 AM

Hi Chris,

thank you for asking! The pruning scripts are provided with the source code. I have compiled step by step instructions for pruning: DT Pruning

Good luck!

Eugen