This page describes the steps to calculate a Distributional Thesaurus from a text corpus.
In this description, we assume twitter as our corpus name and perform bigram holing.
The data should already be on HDFS in the twitter_bigram folder.
Contents
Perform Holing operation
Before calculating the DT, you need to perform the holing operation on your corpus.
For bigram-holing you can use the following shell script:
sh bigram_holing.sh twitter_bigram twitter_bigram__WordFeatureCount 2
The three required parameters are:
- project folder
- output folder for WordFeatureCounts
- the field in the input files where sentence data is located. For a simple text file, use ‘0’.
Generate and Run Hadoop Script
You can generate the Hadoop script with a Python script and a few parameters: python generateHadoopScript.py [parameters]
Parameter | Description | Recommended/ example values |
---|---|---|
dataset | Dataset name: folder where the unprocessed corpus is located |
your_corpus_folder |
wc | Word count: maximum number of unique words a feature is allowed to have |
1000 |
s | Significance threshold: lower threshold for significance scores |
0 |
t | Term threshold: lower threshold for word-feature counts |
2 |
p | Feature count: maximal number of features a word is allowed to have |
1000 |
significance | Significance measure: LMI (Lexicographers Mutual Information), PMI (Pointwise Mutual Information) or LL (Log-Likelihood) |
LMI |
l | Simsort count: maximum number of similar terms for a term |
200 |
queue | (optional) Hadoop queue name; if no input is provided, “default” is used | longrunning |
Your command will look like this:
python generateHadoopScript.py project_folder 1000 0 2 1000 LMI 200
Then start your generated script and wait for it to finish: sh dataset_s0_t2_p1000_LMI_simsort200.sh
During processing, it will create the following output folders. In these folders there are several part-*.gz
files that use the tsv format (values exemplified underneath a folder).
- dataset__PruneFeaturesPerWord_1000__FeatureCount
feature count
- dataset__PruneFeaturesPerWord_1000__WordCount
word count
- dataset__PruneFeaturesPerWord_1000__WordFeatureCount
word feature count
- dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0
word feature significance_score count
- dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000
word feature significance_score count
(pruned to contain only features that occur with at most 1000 words) - dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt
feature word+
(list of words that share the feature) - dataset__PruneFeaturesPerWord_1000__FreqSigLMI_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures
word1 word2 count
Copy Files to your Filesystem
To copy the results to your filesystem, use the HDFS -text command.
The -text
command unzips the files on the fly. You can then pipe the content into a new local file. E.g:
hadoop dfs -text dataset__PruneFeaturesPerWord_1000__WordCount/p* | cat > dataset__WordCount
Database Access
If you want to access the DT using a database, you have to create the database and load the files into tables.
Create Database and Tables
You can create a MySQL database and tables using the using createTables.py. You need to provide the following arguments:
Argument | Description | example value |
---|---|---|
db_name | The database name. This should be a name that is not yet present on the DB server. | twitter_bigram |
p | The number of features per word (used in the DT calculation). | 1000 |
sig_measure | The significance measure used in the DT calculation, LMI, PMI or LL | LMI |
simsort_limit | The maximum number of similar terms for one term. This is also a value that was used in DT calculation. | 200 |
You can save the output into a file that you can use on your DB server:
python createTables.py twitter_bigram 1000 LMI 200 > create_tables.sql
mysql -u root -p < create_tables.sql
This script generates the following tables:
Table | Fields | Description |
---|---|---|
LMI_1000 | word, feature, sig, count | Stores counts and significance scores for a word–feature combination |
LMI_1000_l200 | word1, word2, count | Stores the most similar word2 items for a word1 and their corresponding count |
word_count | word, count | Stores counts for each word |
feature_count | feature, count | Stores counts for each feature |
Import Results into a MySQL Database
To import the files into the database, you can use the provided script, import_data.sql.
You will need to change the paths to your DT files.
mysql -u root -p < import_data.sql
Now you can access your DT from a database by using the DatabaseThesaurusDatastructure class.
If you encounter problems with importing or using your DT, you can try Pruning to get it to work.
How do you prune it? it does not fit into my memory in R!
Hi Chris,
thank you for asking! The pruning scripts are provided with the source code. I have compiled step by step instructions for pruning: DT Pruning
Good luck!
Eugen