Getting Started with JoBimText

This page describes, how to use the jobim text project to use distributional similarities or contextualized similarities in your project. In this example we consider the that you work on the same computer, the Hadoop server is running. The used compenents use some components from DKPro Core, uimaFIT and OpenNLP.

Holing System: Extract the Features

The files needed for this tutorial can be downloaded from the download section and are contained in the archive jobimtext_pipeline_vXXX.tar.gz. For the feature extraction we need also text files. The format of the files should be plain text. A corpus of Web data and of sentences extracted from English Wikipedia are available here and start with the prefix dataset_. We advise to split them, so UIMA does not have to keep the complete file in the memory.
This can be done using the split command from Unix:

split news10M splitted/news10M-part-

Then the holing system can be started to extract the features. This can be done using the shell script holing_operation.sh in the download section, or by executing the jobimtext.example.holing.HolingHadoop class in the maven svn project jobimtext/jobimtext.example. The script can be executed as following

sh holing_operation.sh path pattern output extractor_configuration holing_system_name

and has following parameters:

  • path: path where the files are, or zip in format: jar:file:/dir/file.zip!
  • pattern: pattern the files matches e.g. *.txt for all txt files
  • holing_system_name: Ngram[hole_position,ngram] or MaltParser (MaltParser is only working when using the source code directly and not the jar, as there are some problems with the MlatParser library)
  • extractor_configuration: file that contains all informations needed to configure the output format for Keys and Features

An example for the extractor_configuration file is shown below:

<jobimtext.holing.extractor.JobimExtractorConfiguration>
  <keyValuesDelimiter>  </keyValuesDelimiter>
  <extractorClassName>
      jobimtext.holing.extractor.TokenExtractors$LemmaPos
  </extractorClassName>
  <attributeDelimiter>#</attributeDelimiter>
  <valueDelimiter>_</valueDelimiter>
  <valueRelationPattern>$relation($values)</valueRelationPattern>
  <holeSymbol>@</holeSymbol>
</jobimtext.holing.extractor.JobimExtractorConfiguration>

The output of the holing system using this configuration file leads to a tab separated (specified by keyValuesDelimiter) key and context feature separated list. The element extractorClassName specifies how an entry is concatenated, in this case the lemma and the POS tag of a word are used and concatenated using a dash (#) as defined with attributeDelimiter. The name of the relation and the context features are concatenated following the valueRelationPattern pattern. An output of the holing system using the MaltParser with the introduced extractor file for the sentence “I gave the book to the girl” leads to the following result:

I#PRP   -nsubj(@_give#VB)
give#VB nsubj(@_I#PRP)
give#VB prep(@_to#TO)
give#VB dobj(@_book#NN)
give#VB punct(@_.#.)
the#DT  -det(@_book#NN)
book#NN -dobj(@_give#VB)
book#NN det(@_the#DT)
to#TO   pobj(@_girl#NN)
to#TO   -prep(@_give#VB)
the#DT  -det(@_girl#NN)
girl#NN det(@_the#DT)
girl#NN -pobj(@_to#TO)
.#.     -punct(@_give#VB)

One can observe that the tokens are lemmatized and the Pos tags are concatenated to the lemma of the token using the dash.

Testing Holing extractors

To test different holing operations, you can use the StartHolingOperation.java class from the org.jobimtext.examples.oss project.

This will read list the possible holing operations and you can select one for execution.

available example Holing Operations: 
0    JoTrigramsHoling.xml
1    MaltParserDependencyHoling_de.xml
2    MaltParserDependencyHoling_Collapsed_Lemma.xml
3    JoBigramsHoling.xml
4    JoNgramsHoling.xml
5    FrameHoling.xml
6    MaltParserDependencyHoling_Collapsed.xml
7    MaltParserDependencyHoling.xml

It will display the output folder where the holing operation results are stored.

Calculate the Distributional Similarities

Afterwards the file should be splitted again and then transferred to the file system (hdfs) of the MapReduce server:

split -a 5 -d news10M_hadoop_input splitted/news10M_maltdependency_part-
hadoop dfs -copyFromLocal splitted news10M_maltdependency

the execution pipeline for the MapReducer can be generated using the script generateHadoopScript using the following parameters:

generateHadoopScript.py dataset wc s t p significance simsort_count [computer file_prefix]

with

  • dataset: e.g. news10M_ngram_1_3, news10M_maltparser
  • wc: maximal number of uniq words a feature is allowed to have
  • s: minimal threshold used for the significance between word and feature
  • t: minimal threshold for the word-feature count
  • p: number of features used per word
  • significance: LMI, PMI, LL
  • simsort count: number of similar terms sorted in the last step
  • computer: computer to copy the files to
  • file_prefix: prefix, the files (distributional similarity, WordCount, FeatureCount) will be copied to

for example the command

python generateHadoopScript.py news10M_maltdependency 1000 0 0 1000 LL 200 desktop_computer dt/

will lead to the output file named news10M_maltdependency_s0_t0_p1000_LL_simsort200 with the following content:

#hadoop dfs -rmr context_out    news10M_maltdependency__WordFeatureCount
#hadoop dfs -rmr wordcount_out  news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount
#hadoop dfs -rmr featurecount_out       news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount
#hadoop dfs -rmr freqsig_out    news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0
#hadoop dfs -rmr context_filter_out     news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount
#hadoop dfs -rmr prunegraph_out news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000
#hadoop dfs -rmr aggregate_out  news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt
#hadoop dfs -rmr simcount_out   news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures
#hadoop dfs -rmr simsort_out    news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200
hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.WordFeatureCount news10M_maltdependency news10M_maltdependency__WordFeatureCount True
pig  -param contextout=news10M_maltdependency__WordFeatureCount -param out=news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount -param wc=1000 pig/PruneFeaturesPerWord.pig
hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.FeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount True
hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.WordCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount True
pig  -param s=0 -param t=0 -param wordcountout=news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount -param featurecountout=news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount -param contextout=news10M_maltdependency__PruneFeaturesPerWord_1000__WordFeatureCount -param freqsigout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 pig/FreqSigLL.pig
pig  -param p=1000 -param freqsigout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 -param prunegraphout=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000 pig/PruneGraph.pig
hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.AggrPerFt news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000 news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt True
hadoop jar lib/thesaurus.distributional.hadoop-0.0.6.jar jobimtext.thesaurus.distributional.hadoop.mapreduce.SimCounts1WithFeatures news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures True
pig  -param limit=200 -param IN=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures -param OUT=news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 pig/SimSort.pig
ssh desktop_computer  'mkdir -p dt '
hadoop dfs -text  news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount/p* | ssh desktop_computer  'cat ->  dt/news10M_maltdependency__PruneFeaturesPerWord_1000__WordCount '
hadoop dfs -text  news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0/p* | ssh desktop_computer  'cat ->  dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0 '
hadoop dfs -text  news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount/p* | ssh desktop_computer  'cat ->  dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FeatureCount '
hadoop dfs -text  news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200/p* | ssh desktop_computer  'cat ->  dt/news10M_maltdependency__PruneFeaturesPerWord_1000__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200 '

The first lines of the script are commented and can be used to delete the files from the server. After executing the script we have to wait until the Hadoop server is finished. The files are copied to the specified computer into the folder specified by the prefix.

Add data to be used as database:

Yet we support two databases: MySQL and DCA, a memory-based data server provided within this project. Here we will only describe the DCA server. The configuration files for the DCA are generated using the create_db_dca.sh script:

sh create_db_dca.sh folder prefix database_server

folder: folder where all the files, from the Hadoop step are located
prefix: prefix for the files (e.g. news10M_maltparser)
database_server: name of server, where the database runs

This command create two files PREFIX_dcaserver and PREFIX_dcaserver_tables.xml with the following content:

<jobimtext.util.db.conf.DatabaseTableConfiguration>
  <tableOrder2>subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200</tableOrder2>
  <tableOrder1>subset_wikipedia-maltparser__FreqSigLL_s_0_t_0</tableOrder1>
  <tableValues>subset_wikipedia-maltparser__FeatureCount</tableValues>
  <tableKey>subset_wikipedia-maltparser__WordCount</tableKey>
</jobimtext.util.db.conf.DatabaseTableConfiguration>

and

# TableID       ValType TCPP#   TableLines      CacheSize       MaxValues               DataAllocation          InputFileNames/Dir      FileFilter
subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200  TABLE   8080    0       10000   10000   server[0-19228967]      /home/user/data/out/dt/subset_wikipedia-maltparser__FreqSigLL_s_0_t_0__PruneGraph_p_1000__AggrPerFt__SimCounts1WithFeatures__SimSortlimit_200   NONE
subset_wikipedia-maltparser__FreqSigLL_s_0_t_0  TABLE   8081    0       10000   10000   server[0-19228967]      /home/user/data/out/dt/subset_wikipedia-maltparser__FreqSigLL_s_0_t_0   NONE
subset_wikipedia-maltparser__FeatureCount       SCORE   8082    0       10000   10000   server[0-19228967]      /home/user/data/out/dt/subset_wikipedia-maltparser__FeatureCount        NONE
subset_wikipedia-maltparser__WordCount  SCORE   8083    0       10000   10000   server[0-19228967]      /home/user/data/out/dt/subset_wikipedia-maltparser__WordCount   NONE

Further details for the DCA server are specified within the README file withint the DCA project in the subversion. The server then be started using the PREFIX_dcaserver configuration file using the following command:

java -Xmx... -Xms.... -cp $(echo lib/*jar| tr ' ' ':') com.ibm.sai.dca.server.Server PREFIX_dcaserver

sh apply_dt_ct.sh $APPLY_FOLDER $APPLY_FILE $EXTRACTOR $HOLINGSYSTEM $FILE”_dcaserver” $FILE”_dcaserver_tables.xml”

When also specfying the computer and the folder using the script from the previous script, the data is availabel locally.

Get expansions for new text

When all the data is loaded into the database, we can use script apply_dt_ct.sh to get expansions of words for new documents.

The

--------------------------------------------------------------------------------
sh apply_dt_ct.sh path pattern holing_system_name extractor_configuration database_configuration database_tables
--------------------------------------------------------------------------------

path: path of the files (also zip files could be used e.g.: jar:file:/dir/file.zip!
pattern: pattern the files matches, that should be expanded (e.g. *.txt for all txt files)
extractor_configuration: file that contains all informations needed for the output format for Keys and Features
holing_system_name: Ngram[hole_position,ngram] or MaltParser (Standard)
database_configuration: configuration file needed for the dca server
database_tables: condfiguration file for the java software, specifying the table names
targetword: if true the target word has to be encapsulated using <target>word</target>. Otherwise every word will be expanded. (Default value : true)
--------------------------------------------------------------------------------

The input format of the files can be plain text, when exanding all words. Therefore, the parameter targetword should be set to false. When expanding solely selected words they should be encapsulated by target_word.

For the impatient ones

Here we show an example to execute all steps, where everything (hadoop server) is running on one system using the MaltParser. Probably the number of lines the files are splitted should be adjusted to the used dataset.

Calculate the distributional thesaurus

FILEDIR=/home/user/data
FILE=textfile
OUTPUT=/home/user/data/out

DB_SERVER=server
EXTRACTOR=extractor_standard.xml
HOLINGSYSTEM=MaltParser
HOLINGNAME=maltparser

#Holing Operation
mkdir -p $OUTPUT
mkdir -p $OUTPUT/splitted/
split $FILEDIR/$FILE $OUTPUT/splitted/$FILE
sh holing_operation.sh $OUTPUT/splitted $FILE* $OUTPUT/$FILE-$HOLINGNAME $EXTRACTOR $HOLINGSYSTEM
mkdir $OUTPUT/$FILE-$HOLINGNAME-splitted/

#Compute distributional similarity
split -a 5 -l 2500000 -d $OUTPUT/$FILE-$HOLINGNAME $OUTPUT/$FILE-$HOLINGNAME-splitted/part-
hadoop dfs -copyFromLocal $OUTPUT/$FILE-$HOLINGNAME-splitted $FILE-$HOLINGNAME
mkdir $OUTPUT/dt/
python generateHadoopScript.py $FILE-$HOLINGNAME 0 0 1000 LL 200 localhost $OUTPUT/dt/
sh $FILE-$HOLINGNAME"_s0_t0_p1000_LL_simsort200"

Start the database server

#Load and start databaseserver
sh create_db_dca.sh $OUTPUT/dt/ $FILE $DB_SERVER
java -Xmx3g -cp $(echo lib/*jar| tr ' ' ':') com.ibm.sai.dca.server.Server $FILE"_dcaserver"

Expand the text in a given text file

APPLY_FOLDER=./
APPLY_FILE=test.txt

#start dt and ct on file
sh apply_dt_ct.sh $APPLY_FOLDER $APPLY_FILE $EXTRACTOR $HOLINGSYSTEM $FILE"_dcaserver" $FILE"_dcaserver_tables.xml"

Leave a Reply

Your email address will not be published. Required fields are marked *