DRUID | JoBimText

DRUID is an unsupervised method for ranking words according to their multiwordness. The method is based on JoBimText and is an implementation of the method described in [1].

Contents

1 Precomputed Models
2 Prerequisites
3 Computing DRUID Rankings
4 Extracting Multiword Expressions
5 Offline Filtering
6 Using DRUID for Extracting Keyphrases
7 Extending the Maximal Length of MWE
8 Demonstration of MWEs
9 Publication:

Precomputed Models

We provide some models computed from different corpora which can be used directly:

Language	Corpus	Processing	filter	download
English	Wikipedia	Token+POS, length 1-4, Trigram	stopwords: 50 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter	link
English	Wikipedia	Token+POS, length 1-4, POS Trigram	stopwords: 100 most frequent words; filter stopsymbols; fiter numbers; Katz POS filter	link
English	Wikipedia	Token+POS, length 1-4, POS Trigram	filter stopsymbols; fiter numbers; Katz POS filter	link
Swedish	Spraakbanken	Token, length 2-4, Trigram	stopwords: 50 most frequent words; filter stopsymbols; filter numbers	link
Dutch	Corpus from the Web	Token, length 2-4, Trigram	stopwords: 50 most frequent words; filter stopsymbols; filter numbers	link
German	Newspaper	Token, length 2-4, Trigram	stopwords: 50 most frequent words; filter stopsymbols; filter numbers	link

Prerequisites

For the computation a Hadoop cluster is required, our software package and some corpus which you will compute the MWEs on. We expect the corpus to be processed in a manner, that there is one sentence per line. Currently, we support tokenization for English and German. The corpus needs to be loaded to the HDFS file system. If you have a folder with the files of the corpus you can easily load them to HDFS by using e.g. the following command:

hadoop fs -copyFromLocal corpus corpus_hdfs

Be aware the corpus_hdfs is a folder containing files and not a file itself. A list of the hdfs commands can be found here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Computing DRUID Rankings

To compute a distributional thesaurus (DT) and multiword expressions (MWE) the following steps are required:

log on the Hadoop clusters

download our software package (version needs to be higher then 0.1.3):

wget http://ltmaggie.informatik.uni-hamburg.de/jobimtext/jobimtext_pipeline_0.1.3.tar.gz

unpack the software:

tar xfvz jobimtext_pipeline_0.1.3.tar.gz

change to the folder
```
cd jobimtext_pipeline_0.1.3
```

create a shell script with the pipeline to compute a DT and the MWEs. In the example expect the corpus folder on HDFS is named corpus_hdfs. We provide two preprocessings: 1) no linguistic information, except tokenization (-hl mwe_trigram) or 2) tokenization and POS tagging (-hl mwe_trigram_pos). Furthermore, we provide several parameters for filtering word sequences as described in the following table:

Parameter	Description
-mwe-stopword-filter	remove MWEs that start or end with a stopword. Using this option the 100 most frequent words of the corpus are considered as stopwords
-mwe-stopword-filter -mwe-stopword-filter-top N	Same as using -mwe-stopword-filter but the number of most frequent words can be specified by N
-mwe-stopword-filter -mwe-stopword-filter-file <FILE>	Uses a specified file <FILE> as stopword list. This file needs to be uploaded to the HDFS
-mwe-pos-filter	Filters word sequences according to their POS sequence (the first letter of each POS is considered) using the filter defined by Katz (([JN]+\|[JN][NP]?[JN])) that extracts noun compounds
-mwe-pos-filter -mwe-pos-filter-regex	Filters word sequences according to a regular expressions based on the POS sequence
-mwe-filter-len	specifies the minimum size of words a word sequence needs to have [default: 2]
-mwe-filter-numbers	Filters all word sequences containing numbers
-mwe-filter-stopsymbols	Filters word sequences containing a word that starts with a stopsymbol (!.;,)(][}{)

python generateHadoopScript.py -mwe -hl mwe_trigram -mwe-stopword-filter -mwe-stopword-filter-top 50 -nb corpus_hdfs

the command creates a shell script which you can execute

sh corpus_hdfs_mwe_trigram_s0.0_f2_w2_wf0_wpfmax1000_wpfmin2_p1000_sc_one_LMI_simsort_ms_2_l200.sh

Extracting Multiword Expressions

After the computation is finished you can have a look at the n-grams, which are already sorted according to their DRUID score. Depending on the parameters used these lists are already filtered. The list can be stored from the HDFS as follows:

hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2__druid_filtered_sw_wc_50_pos_none_ml_2_stopsym_F_num_F_sorted/p* > druid_ranked_words
less druid_ranked_words

The score in the second column is the uniqueness score.

Offline Filtering

In addition to using the filtering options using Hadoop, we demonstrate how to filter out single worded terms (which could be used as keyphrases in information retrieval) using the following command:

cat druid_ranked_words | grep -v " " > druid_ranked_mwe
less druid_ranked_mwe

For cleaning the ranked list, also stopwords/stopsymbols (e.g. .,/?'”) should be filtered, which appear at the beginning and ending of the word sequence. Symbols and numbers can simply filtered with the following command:

cat druid_ranked_words | grep -v $'^[,.0-9()\']' > druid_ranked_words_filtered
less druid_ranked_words_filtered

An additional filtering according to POS tags could further clean the resulting MWEs.
In addition to the n-gram ranking there is also a DT which contains all n-grams. It can be viewed with the following command:

hadoop fs -text corpus_hdfs_mwe_trigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_True__SimSortWithFeatureslimit_200_minsim_2/p* |less

Using DRUID for Extracting Keyphrases

As already mentioned in the previous section DRUID is not only able to rank MWEs but also single word units. These could be useful as keywords/keyphrases as used in information retrieval. For this the parameter -mwe-filter-len should be set to 1. Using the stopsymbol/number filter as described above results to the following words based on a Wikipedia corpus:

Baranov
forty-fourth
Conroy
josephinae
U-26
U-78
Jamieson
U-31
U-4
fallax
U-3
Mackie
McMullen

Extending the Maximal Length of MWE

Currently the computation is performed for n-grams with n=1 up to 4. If you want to change this, edit the file jobimtext_pipeline_0.1.3/descriptors/holing/MWE_Trigram_Holing.xml. Within this file you will find the entry:

<nameValuePair>
   <name>length</name>
   <value>
      <integer>4</integer>
   </value>
</nameValuePair>

The number with the “integer” tag specifies the maximum length of the n-gram.

Demonstration of MWEs

We also have a Web demo, which also contains a DT with MWEs which is available here. To see MWEs click on the arrow on the right side of the Parse button and select “Medline Trigram (MWE)” and insert a sentence like ‘red blood cells are by far the most abundant cells in the blood‘. The system will mark MWEs in the graph if they are found in the ranked MWE list (using some threshold).

Publication:

DRUID is based on the following publication:

[1] Martin Riedl, Chris Biemann (2015): A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). Lisboa, Portugal