Dependency Collapsing

This component performs dependency collapsing and propagation on dependency parse. It offers extensible rulesets for German and English that can be adapted to any parser. This work is described in:
Ruppert, E., Klesy, J., Riedl, M., and Biemann, C. (2015): Rule-based Dependency Parse Collapsing and Propagation for German and English. In Proc. GSCL 2015, Essen, Germany.

Input formats

As POS taggers and parsers are included in the packages, it is sufficient to provide plain text files (*.txt). This format is used by default.

They sit in the car.

The collapsing package also accepts the CoNLL format, either in the 2-column  (tokenized) 6-column (POS tagged) or the 10-colum (parsed) format:

tokenized:
1    They
2    sit
3    in
4    the
5    car
6    .

POS tagged:
1    They    _    PRP    PRP    _
2    sit    _    VBP    VBP    _
3    in    _    IN    IN    _
4    the    _    DT    DT    _
5    car    _    NN    NN    _
6    .    _    .    .    _

parsed:
1    They    _    PRP    PRP    _    2    nsubj    _    _
2    sit    _    VBP    VBP    _    0    ROOT    _    _
3    in    _    IN    IN    _    2    prep    _    _
4    the    _    DT    DT    _    5    det    _    _
5    car    _    NN    NN    _    3    pobj    _    _
6    .    _    .    .    _    2    punct    _    _

Please note that POS tagging and parsing is enabled by default! If do not want to re-do the tagging or parsing, use the parameters -nt (no tagging) or -np (no parsing).

Output formats:

CoNLL

By default, the collapsing parser outputs CoNLL format:

T_ID Word [Lemma] CPOS   POS    __   HEAD DEPREL   __  __

e.g:
1    They    _    PRP    PRP    _    2    nsubj    _    _

Dependency Output

You can also specify an output of dependencies, similar to the Stanford parser (parameter -depout):

nsubj(sit-2, They-1)
root(ROOT-0, sit-2)

It is possible to add POS tags to the dependency output. Use the -addpos parameter to achieve such output.

nsubj(sit#VBP-2, They#PRP-1)
root(ROOT-0, sit#VBP-2)

Usage:

Extract the contents of the zip file. Then execute the runnable jar:

java -jar org.jobimtext.collapsing.jar

Parameters:

Only 2 parameters are required, input and output folders, other parameters are optional:

java -jar org.jobimtext.collapsing.jar -i INPUT_PATH -o OUTPUT_PATH
-i,–input *Required: path to input folder (or file). For single file processing, set the ‘-sf’ flag!
-sf, –single-file flag to indicate single file processing, instead of folder processing [default: folder processing]
-o, –output <arg> *Required: path to output folder
-depout, –dependency-output flag to enable universal dependencies output format, Stanford-like [default:false]
-addpos, –add-pos-tags only for dependency output! option to include POS tags to the tokens [default:false]
-c, –collapsing <arg> apply dependency collapsing rules [default: true]
-p, –propagation <arg> enable dependency propagation, only performed when collapsing is enabled [default: true]
-f, –format <arg> input file format, ‘t’ext or ‘c’onll [default: t]
-l, –language <arg> language of input files [default: en]
-np, –no-parsing flag to disable parsing, e.g. when using pre-parsed CoNll data [default: parsing enabled]
-nt, –no-tagging flag to disable POS tagging, e.g. when using pre-tagged CoNll data [default:tagging enabled]

Execution Examples

Language selection

Dependency parsing and collapsing from raw text (English, default):

java -jar org.jobimtext.collapsing.jar -i corpus/en -o output

Dependency parsing and collapsing from raw text (German):

java -jar org.jobimtext.collapsing.jar -i corpus/de -o output -l de

Input format

Text input (default):

java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -f t

CoNLL input:

java -jar org.jobimtext.collapsing.jar -i corpus/en_connl -o output -f c

for testing: use single file instead of folder processing:

java -jar org.jobimtext.collapsing.jar -i corpus/en/english.txt -o output -sf

Collapsing Options

Dependency parsing without collapsing

java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -c 0

Dependency parsing with collapsing, without propagation

java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -p 0

use custom rule file (make sure that the correct language is specified!):

java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -r /PATH_TO/rulefile.txt

Use POS-tagged/parsed input

[Only possible on CoNLL input, raw text does not have POS tags or parses.]
Disable POS tagging (parsing still enabled by default):

java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -nt

Disable parsing (POS tagging enabled by default):

java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -np

Only apply dependency collapsing:

java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -np -nt

Dependency output

java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output_dep -depout

Download

We provide an ASL and GPL licensed packages. ASL uses OpenNLP tools for tokenizing and tagging, and the Maltparser for dependency parsing.

GPL uses OpenNLP for tokenizing and the Mate-tools POS tagger/parser for the remaining tasks. Please note, that currently we are working on converting the English Mate-tools parser output to universal dependencies. Therefore, we currently use the Maltparser for English parsing.

Collapsing ASL

collapsing-asl.zip

Collapsing GPL

collapsing-gpl.zip

Rule Download

The rule files can be downloaded from the rule repository in the JoBimText SVN repository. Currently, you can download the collapsing rules (with and without propagation) for English and German.

We welcome further contributions/adaptations to different parsers!

Leave a Reply

Your email address will not be published. Required fields are marked *