This page describes the steps to calculate a Distributional Thesaurus from a text corpus.
In this description, we assume twitter as our corpus name and perform bigram holing.
The data should already be on HDFS in the twitter_bigram folder.
Perform Holing operation
Before calculating the DT, you need to perform the holing operation on your corpus.
For bigram-holing you can use the following shell script:
sh bigram_holing.sh twitter_bigram twitter_bigram__WordFeatureCount 2
The three required parameters are:
- project folder
- output folder for WordFeatureCounts
- the field in the input files where sentence data is located. For a simple text file, use ‘0’.
Generate and Run Hadoop Script
You can generate the Hadoop script with a Python script and a few parameters:
python generateHadoopScript.py [parameters]
folder where the unprocessed corpus is located
|wc||Word count: maximum number of
unique words a feature is allowed to have
lower threshold for significance scores
lower threshold for word-feature counts
maximal number of features a word is allowed to have
|significance||Significance measure: LMI (Lexicographers Mutual Information),
PMI (Pointwise Mutual Information) or LL (Log-Likelihood)
maximum number of similar terms for a term
|queue||(optional) Hadoop queue name; if no input is provided, “default” is used||longrunning|
Your command will look like this:
python generateHadoopScript.py project_folder 1000 0 2 1000 LMI 200
Then start your generated script and wait for it to finish:
sh dataset_s0_t2_p1000_LMI_simsort200.sh During processing, it will create the following output folders. In these folders there are several
part-*.gz files that use the tsv format (values exemplified underneath a folder).
word feature count
word feature significance_score count
word feature significance_score count(pruned to contain only features that occur with at most 1000 words)
feature word+(list of words that share the feature)
word1 word2 count
Copy Files to your Filesystem
To copy the results to your filesystem, use the HDFS -text command.
-text command unzips the files on the fly. You can then pipe the content into a new local file. E.g:
hadoop dfs -text dataset__PruneFeaturesPerWord_1000__WordCount/p* | cat > dataset__WordCount
If you want to access the DT using a database, you have to create the database and load the files into tables.
Create Database and Tables
You can create a MySQL database and tables using the using createTables.py. You need to provide the following arguments:
|db_name||The database name. This should be a name that is not yet present on the DB server.||twitter_bigram|
|p||The number of features per word (used in the DT calculation).||1000|
|sig_measure||The significance measure used in the DT calculation, LMI, PMI or LL||LMI|
|simsort_limit||The maximum number of similar terms for one term. This is also a value that was used in DT calculation.||200|
You can save the output into a file that you can use on your DB server:
python createTables.py twitter_bigram 1000 LMI 200 > create_tables.sql mysql -u root -p < create_tables.sql
This script generates the following tables:
|LMI_1000||word, feature, sig, count||Stores counts and significance scores
for a word–feature combination
|LMI_1000_l200||word1, word2, count||Stores the most similar word2 items for a word1
and their corresponding count
|word_count||word, count||Stores counts for each word|
|feature_count||feature, count||Stores counts for each feature|
Import Results into a MySQL Database
To import the files into the database, you can use the provided script, import_data.sql.
You will need to change the paths to your DT files.
mysql -u root -p < import_data.sql
Now you can access your DT from a database by using the DatabaseThesaurusDatastructure class.
If you encounter problems with importing or using your DT, you can try Pruning to get it to work.