DRUID

DRUID is an unsupervised method for ranking words according to their multiwordness. The method is based on JoBimText and is an implementation of the method described in [1].

Precomputed Models

We provide some models computed from different corpora which can be used directly:

Language Corpus Processing filter download
 English  Wikipedia  Token+POS, length 1-4, Trigram stopwords: 50 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter  link
 English  Wikipedia  Token+POS, length 1-4, POS Trigram  stopwords: 100 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter  link
English Wikipedia Token+POS, length 1-4, POS Trigram filter stopsymbols; fiter numbers; Katz POS filter link
 Swedish  Spraakbanken Token, length 2-4, Trigram  stopwords: 50 most frequent words; filter stopsymbols; filter numbers  link
 Dutch  Corpus from the Web  Token, length 2-4, Trigram stopwords: 50 most frequent words; filter stopsymbols; filter numbers  link
German Newspaper Token, length 2-4, Trigram stopwords: 50 most frequent words; filter stopsymbols; filter numbers link

 

Prerequisites

For the computation a Hadoop cluster is required, our software package and some corpus which you will compute the MWEs on. We expect the corpus to be processed in a manner, that there is one sentence per line. Currently, we support tokenization for English and German. The corpus needs to be loaded to the HDFS file system. If you have a folder with the files of the corpus you can easily load them to HDFS by using e.g. the following command:

hadoop fs -copyFromLocal corpus corpus_hdfs

Be aware the corpus_hdfs is a folder containing files and not a file itself. A list of the hdfs commands can be found here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Computing DRUID Rankings

To compute a distributional thesaurus (DT) and multiword expressions (MWE) the following steps are required:

  1. log on the Hadoop clusters
  2. download our software package (version needs to be higher then 0.1.3):
    wget http://ltmaggie.informatik.uni-hamburg.de/jobimtext/jobimtext_pipeline_0.1.3.tar.gz
  3. unpack the software:
    tar xfvz jobimtext_pipeline_0.1.3.tar.gz
  4. change to the folder
    cd jobimtext_pipeline_0.1.3
  5. create a shell script with the pipeline to compute a DT and the MWEs. In the example expect the corpus folder on HDFS is named corpus_hdfs. We provide two preprocessings: 1) no linguistic information, except tokenization (-hl mwe_trigram) or 2) tokenization and POS tagging (-hl mwe_trigram_pos). Furthermore, we provide several parameters for filtering word sequences as described in the following table:
     Parameter Description
    -mwe-stopword-filter remove MWEs that start or end with a stopword. Using this option the 100 most frequent words of the corpus are considered as stopwords
     -mwe-stopword-filter
    -mwe-stopword-filter-top N
    Same as using -mwe-stopword-filter but the number of most frequent words can be specified by N
    -mwe-stopword-filter
    -mwe-stopword-filter-file <FILE>
    Uses a specified file <FILE> as stopword list. This file needs to be uploaded to the HDFS
    -mwe-pos-filter Filters word sequences according to their POS sequence (the first letter of each POS is considered) using the filter defined by Katz (([JN]+|[JN]*[NP]?[JN]*)) that extracts noun compounds
    -mwe-pos-filter -mwe-pos-filter-regex Filters word sequences according to a regular expressions based on the POS sequence
    -mwe-filter-len specifies the minimum size of words a word sequence needs to have [default: 2]
    -mwe-filter-numbers Filters all word sequences containing numbers
    -mwe-filter-stopsymbols Filters word sequences containing a word that starts with a stopsymbol (!.;,)(][}{)
    python generateHadoopScript.py -mwe -hl mwe_trigram -mwe-stopword-filter -mwe-stopword-filter-top 50 -nb corpus_hdfs
  6. the command creates a shell script which you can execute
    sh corpus_hdfs_mwe_trigram_s0.0_f2_w2_wf0_wpfmax1000_wpfmin2_p1000_sc_one_LMI_simsort_ms_2_l200.sh

 

Extracting Multiword Expressions

After the computation is finished you can have a look at the n-grams, which are already sorted according to their DRUID score. Depending on the parameters used these lists are already filtered. The list can be stored from the HDFS as follows:

hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2__druid_filtered_sw_wc_50_pos_none_ml_2_stopsym_F_num_F_sorted/p* > druid_ranked_words
less druid_ranked_words

The score in the second column is the uniqueness score.

Offline Filtering

In addition to using the filtering options using Hadoop, we demonstrate how to filter  out single worded terms (which could be used as keyphrases in information retrieval) using the following command:

cat druid_ranked_words | grep -v " " > druid_ranked_mwe
less druid_ranked_mwe

For cleaning the ranked list, also stopwords/stopsymbols (e.g. .,/?'”) should be filtered, which appear at the beginning and ending of the word sequence. Symbols and numbers can simply filtered with the following command:

cat druid_ranked_words | grep -v $'^[,.0-9()\']' > druid_ranked_words_filtered
less druid_ranked_words_filtered

An additional filtering according to POS tags could further clean the resulting MWEs.
In addition to the n-gram ranking there is also a DT which contains all n-grams. It can be viewed with the following command:

hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2/p* |less

Using DRUID for Extracting Keyphrases

As already mentioned in the previous section DRUID is not only able to rank MWEs but also single word units. These could be useful as keywords/keyphrases as used in information retrieval. For this the parameter -mwe-filter-len should be set to 1. Using the stopsymbol/number filter as described above results to the following words based on a Wikipedia corpus:

Baranov
forty-fourth
Conroy
josephinae
U-26
U-78
Jamieson
U-31
U-4
fallax
U-3
Mackie
McMullen

Extending the Maximal Length of MWE

Currently the computation is performed for n-grams with n=1 up to 4. If you want to change this, edit the file jobimtext_pipeline_0.1.3/descriptors/holing/MWE_Trigram_Holing.xml. Within this file you will find the entry:

<nameValuePair>
   <name>length</name>
   <value>
      <integer>4</integer>
   </value>
</nameValuePair>

The number with the “integer” tag specifies the maximum length of the n-gram.

Demonstration of MWEs

We also have a Web demo, which also contains a DT with MWEs which is available here. To see MWEs click on the arrow on the right side of the Parse button and select “Medline Trigram (MWE)” and insert a sentence like ‘red blood cells are by far the most abundant cells in the blood‘. The system will mark MWEs in the graph if they are found in the ranked MWE list (using some threshold).

Publication:

DRUID is based on the following publication:

[1] Martin Riedl, Chris Biemann (2015): A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). Lisboa, Portugal