DRUID is an unsupervised method for ranking words according to their multiwordness. The method is based on JoBimText and is an implementation of the method described in [1].
Contents
Precomputed Models
We provide some models computed from different corpora which can be used directly:
Language | Corpus | Processing | filter | download |
---|---|---|---|---|
English | Wikipedia | Token+POS, length 1-4, Trigram | stopwords: 50 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter | link |
English | Wikipedia | Token+POS, length 1-4, POS Trigram | stopwords: 100 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter | link |
English | Wikipedia | Token+POS, length 1-4, POS Trigram | filter stopsymbols; fiter numbers; Katz POS filter | link |
Swedish | Spraakbanken | Token, length 2-4, Trigram | stopwords: 50 most frequent words; filter stopsymbols; filter numbers | link |
Dutch | Corpus from the Web | Token, length 2-4, Trigram | stopwords: 50 most frequent words; filter stopsymbols; filter numbers | link |
German | Newspaper | Token, length 2-4, Trigram | stopwords: 50 most frequent words; filter stopsymbols; filter numbers | link |
Prerequisites
For the computation a Hadoop cluster is required, our software package and some corpus which you will compute the MWEs on. We expect the corpus to be processed in a manner, that there is one sentence per line. Currently, we support tokenization for English and German. The corpus needs to be loaded to the HDFS file system. If you have a folder with the files of the corpus you can easily load them to HDFS by using e.g. the following command:
hadoop fs -copyFromLocal corpus corpus_hdfs
Be aware the corpus_hdfs is a folder containing files and not a file itself. A list of the hdfs commands can be found here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Computing DRUID Rankings
To compute a distributional thesaurus (DT) and multiword expressions (MWE) the following steps are required:
- log on the Hadoop clusters
- download our software package (version needs to be higher then 0.1.3):
wget http://ltmaggie.informatik.uni-hamburg.de/jobimtext/jobimtext_pipeline_0.1.3.tar.gz
- unpack the software:
tar xfvz jobimtext_pipeline_0.1.3.tar.gz
- change to the folder
cd jobimtext_pipeline_0.1.3
- create a shell script with the pipeline to compute a DT and the MWEs. In the example expect the corpus folder on HDFS is named corpus_hdfs. We provide two preprocessings: 1) no linguistic information, except tokenization (-hl mwe_trigram) or 2) tokenization and POS tagging (-hl mwe_trigram_pos). Furthermore, we provide several parameters for filtering word sequences as described in the following table:
Parameter Description -mwe-stopword-filter remove MWEs that start or end with a stopword. Using this option the 100 most frequent words of the corpus are considered as stopwords -mwe-stopword-filter
-mwe-stopword-filter-top NSame as using -mwe-stopword-filter but the number of most frequent words can be specified by N -mwe-stopword-filter
-mwe-stopword-filter-file <FILE>Uses a specified file <FILE> as stopword list. This file needs to be uploaded to the HDFS -mwe-pos-filter Filters word sequences according to their POS sequence (the first letter of each POS is considered) using the filter defined by Katz (([JN]+|[JN]*[NP]?[JN]*)) that extracts noun compounds -mwe-pos-filter -mwe-pos-filter-regex Filters word sequences according to a regular expressions based on the POS sequence -mwe-filter-len specifies the minimum size of words a word sequence needs to have [default: 2] -mwe-filter-numbers Filters all word sequences containing numbers -mwe-filter-stopsymbols Filters word sequences containing a word that starts with a stopsymbol (!.;,)(][}{) python generateHadoopScript.py -mwe -hl mwe_trigram -mwe-stopword-filter -mwe-stopword-filter-top 50 -nb corpus_hdfs
- the command creates a shell script which you can execute
sh corpus_hdfs_mwe_trigram_s0.0_f2_w2_wf0_wpfmax1000_wpfmin2_p1000_sc_one_LMI_simsort_ms_2_l200.sh
Extracting Multiword Expressions
After the computation is finished you can have a look at the n-grams, which are already sorted according to their DRUID score. Depending on the parameters used these lists are already filtered. The list can be stored from the HDFS as follows:
hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2__druid_filtered_sw_wc_50_pos_none_ml_2_stopsym_F_num_F_sorted/p* > druid_ranked_words less druid_ranked_words
The score in the second column is the uniqueness score.
Offline Filtering
In addition to using the filtering options using Hadoop, we demonstrate how to filter out single worded terms (which could be used as keyphrases in information retrieval) using the following command:
cat druid_ranked_words | grep -v " " > druid_ranked_mwe less druid_ranked_mwe
For cleaning the ranked list, also stopwords/stopsymbols (e.g. .,/?'”) should be filtered, which appear at the beginning and ending of the word sequence. Symbols and numbers can simply filtered with the following command:
cat druid_ranked_words | grep -v $'^[,.0-9()\']' > druid_ranked_words_filtered less druid_ranked_words_filtered
An additional filtering according to POS tags could further clean the resulting MWEs.
In addition to the n-gram ranking there is also a DT which contains all n-grams. It can be viewed with the following command:
hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2/p* |less
Using DRUID for Extracting Keyphrases
As already mentioned in the previous section DRUID is not only able to rank MWEs but also single word units. These could be useful as keywords/keyphrases as used in information retrieval. For this the parameter -mwe-filter-len should be set to 1. Using the stopsymbol/number filter as described above results to the following words based on a Wikipedia corpus:
Baranov forty-fourth Conroy josephinae U-26 U-78 Jamieson U-31 U-4 fallax U-3 Mackie McMullen
Extending the Maximal Length of MWE
Currently the computation is performed for n-grams with n=1 up to 4. If you want to change this, edit the file jobimtext_pipeline_0.1.3/descriptors/holing/MWE_Trigram_Holing.xml. Within this file you will find the entry:
<nameValuePair> <name>length</name> <value> <integer>4</integer> </value> </nameValuePair>
The number with the “integer” tag specifies the maximum length of the n-gram.
Demonstration of MWEs
We also have a Web demo, which also contains a DT with MWEs which is available here. To see MWEs click on the arrow on the right side of the Parse button and select “Medline Trigram (MWE)” and insert a sentence like ‘red blood cells are by far the most abundant cells in the blood‘. The system will mark MWEs in the graph if they are found in the ranked MWE list (using some threshold).
Publication:
DRUID is based on the following publication: