This component performs dependency collapsing and propagation on dependency parse. It offers extensible rulesets for German and English that can be adapted to any parser. This work is described in:
Ruppert, E., Klesy, J., Riedl, M., and Biemann, C. (2015): Rule-based Dependency Parse Collapsing and Propagation for German and English. In Proc. GSCL 2015, Essen, Germany.
Contents
Input formats
As POS taggers and parsers are included in the packages, it is sufficient to provide plain text files (*.txt). This format is used by default.
They sit in the car.
The collapsing package also accepts the CoNLL format, either in the 2-column (tokenized) 6-column (POS tagged) or the 10-colum (parsed) format:
tokenized: 1 They 2 sit 3 in 4 the 5 car 6 . POS tagged: 1 They _ PRP PRP _ 2 sit _ VBP VBP _ 3 in _ IN IN _ 4 the _ DT DT _ 5 car _ NN NN _ 6 . _ . . _ parsed: 1 They _ PRP PRP _ 2 nsubj _ _ 2 sit _ VBP VBP _ 0 ROOT _ _ 3 in _ IN IN _ 2 prep _ _ 4 the _ DT DT _ 5 det _ _ 5 car _ NN NN _ 3 pobj _ _ 6 . _ . . _ 2 punct _ _
Please note that POS tagging and parsing is enabled by default! If do not want to re-do the tagging or parsing, use the parameters -nt
(no tagging) or -np
(no parsing).
Output formats:
CoNLL
By default, the collapsing parser outputs CoNLL format:
T_ID Word [Lemma] CPOS POS __ HEAD DEPREL __ __ e.g: 1 They _ PRP PRP _ 2 nsubj _ _
Dependency Output
You can also specify an output of dependencies, similar to the Stanford parser (parameter -depout
):
nsubj(sit-2, They-1) root(ROOT-0, sit-2)
It is possible to add POS tags to the dependency output. Use the -addpos
parameter to achieve such output.
nsubj(sit#VBP-2, They#PRP-1) root(ROOT-0, sit#VBP-2)
Usage:
Extract the contents of the zip file. Then execute the runnable jar:
java -jar org.jobimtext.collapsing.jar
Parameters:
Only 2 parameters are required, input and output folders, other parameters are optional:
java -jar org.jobimtext.collapsing.jar -i INPUT_PATH -o OUTPUT_PATH
-i,–input | *Required: path to input folder (or file). For single file processing, set the ‘-sf’ flag! |
-sf, –single-file | flag to indicate single file processing, instead of folder processing [default: folder processing] |
-o, –output <arg> | *Required: path to output folder |
-depout, –dependency-output | flag to enable universal dependencies output format, Stanford-like [default:false] |
-addpos, –add-pos-tags | only for dependency output! option to include POS tags to the tokens [default:false] |
-c, –collapsing <arg> | apply dependency collapsing rules [default: true] |
-p, –propagation <arg> | enable dependency propagation, only performed when collapsing is enabled [default: true] |
-f, –format <arg> | input file format, ‘t’ext or ‘c’onll [default: t] |
-l, –language <arg> | language of input files [default: en] |
-np, –no-parsing | flag to disable parsing, e.g. when using pre-parsed CoNll data [default: parsing enabled] |
-nt, –no-tagging | flag to disable POS tagging, e.g. when using pre-tagged CoNll data [default:tagging enabled] |
Execution Examples
Language selection
Dependency parsing and collapsing from raw text (English, default):
java -jar org.jobimtext.collapsing.jar -i corpus/en -o output
Dependency parsing and collapsing from raw text (German):
java -jar org.jobimtext.collapsing.jar -i corpus/de -o output -l de
Input format
Text input (default):
java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -f t
CoNLL input:
java -jar org.jobimtext.collapsing.jar -i corpus/en_connl -o output -f c
for testing: use single file instead of folder processing:
java -jar org.jobimtext.collapsing.jar -i corpus/en/english.txt -o output -sf
Collapsing Options
Dependency parsing without collapsing
java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -c 0
Dependency parsing with collapsing, without propagation
java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -p 0
use custom rule file (make sure that the correct language is specified!):
java -jar org.jobimtext.collapsing.jar -i corpus/en -o output -r /PATH_TO/rulefile.txt
Use POS-tagged/parsed input
[Only possible on CoNLL input, raw text does not have POS tags or parses.]
Disable POS tagging (parsing still enabled by default):
java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -nt
Disable parsing (POS tagging enabled by default):
java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -np
Only apply dependency collapsing:
java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output -f c -np -nt
Dependency output
java -jar org.jobimtext.collapsing.jar -i corpus_tagged -o output_dep -depout
Download
We provide an ASL and GPL licensed packages. ASL uses OpenNLP tools for tokenizing and tagging, and the Maltparser for dependency parsing.
GPL uses OpenNLP for tokenizing and the Mate-tools POS tagger/parser for the remaining tasks. Please note, that currently we are working on converting the English Mate-tools parser output to universal dependencies. Therefore, we currently use the Maltparser for English parsing.
Collapsing ASL
Collapsing GPL
Rule Download
The rule files can be downloaded from the rule repository in the JoBimText SVN repository. Currently, you can download the collapsing rules (with and without propagation) for English and German.
We welcome further contributions/adaptations to different parsers!