Compute Sense Clusters

This effect occurs mostly when the edge file is not sorted properly. The solution would be to sort the edge file again and delete all intermediate files:

sort -k1,1 -k3,3nr file.edges > newfile.edges
rm file.edges.*

Similarity Calculations using Hadoop

We recommend to use only features that occur with less then 1000 words and specify to use only positive significance scores. We recommend, using the LMI significance measure and keep only the top 1000 features per term with the highest significance scores. Normally it should be sufficient to keep the top 200 most similar terms for each term. This is achieved with:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200

When I try to execute one of the DT compuation pipelines, I get an error saying “Unsupported major.minor version 51.0”. Why doesn’t it work?

The problem is that on your Hadoop machines, you have a Java version that is older than the one used for compilation.

You can compile the JoBimText project yourself. There you can change the required Java version. We recommend using at least Java 7, Java 6 might also work.

Here is a list of Java versions:

J2SE 8 = 52
J2SE 7 = 51
J2SE 6.0 = 50
J2SE 5.0 = 49
JDK 1.4 = 48
JDK 1.3 = 47
JDK 1.2 = 46
JDK 1.1 = 45

The computations of the similarities stop with an exception e.g. ERROR 2997: Encountered IOException. File pig/FreqSig1000.pig does not exist.

The reason for that error is a wrong parameter, when generating the script (generateHadoopScript.py) to run the Hadoop pipeline. Instead of a significance measure a number was given to the script. Following parameters could be used for generating the script:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200

I encounter Out-for-memory errors during the first step of the similarity calculation — the Holing operation. What to do?

This can happen if your input corpus contains very long lines. We recommend to split your corpus into single sentences, so that the input contains one sentence per line. If you cannot even run the sentence filtering operation, you can use streaming to split your data.

  • First, create a new folder, e.g. ending in “split”,
  • Perform sentence splitting by splitting the sentences at full stops:
    hadoop fs -text CORPUS/* | sed 's/[.]/.\n/g' | hadoop fs -put - CORPUS_split/corpus.txt
  • Generate a new pipeline, this time using the CORPUS_split as dataset argument, using the python script
  • Run the pipeline. If you still get similar errors, then try the sentence filtering operation again. It should work now.

I encounter Out-of-memory issues when I try to use a Distributional Thesaurus. What can be done?

When the DT was computed on a large corpus, it features too many entries that are not significant (singletons, very low similarity scores, etc.). Therefore it is advisable to prune a large DT to make it feasible. The quality does not suffer from pruning.

When I run the DT Pipeline, I encounter Java version errors, like ‘Unsupported major.minor version 51.0.’ What can I do?

We advise you to upgrade Java to at least Java 1.6. It is well established and stable, and should therefore be used. If it is not possible to upgrade or to ask for an upgrade, you can compile JoBimText with a lower target setting for javac.

As many of the used libraries also require Java 1.6, you will need to downgrade them as well. This especially affects dkpro components, that were compiled for Java 1.6:

  • de.tudarmstadt.ukp.dkpro.core.api.lexmorph-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.api.metadata-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.api.parameter-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.api.resources-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.api.segmentation-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.api.syntax-asl-1.6.0.jar
  • de.tudarmstadt.ukp.dkpro.core.maltparser-asl-1.6.0.jar

You can find replacements at GrepCode. To avoid problems, remove the newer files (*-1.6.0.jar). Note, however, that this solution might not work in the future, when JoBimText depends on newer Java versions.

Import and Usage

There are many DT entries with a count of 0 in the database. How is this possible? The input data looks correct.

This is likely an escaping problem. The DT file format is tab-separated. Therefore, when an entry ends with a backshash ‘\’, the database assumes, that the following tab character is escaped and should not be used for separation. This can be resolved with a simple sed command:

sed -i 's/\\/\\\\/g' FILE


Leave a Reply

Your email address will not be published. Required fields are marked *