Monthly Archives: May 2014

Test Corpus released, 1M Sentences

May 28, 2014Updateseuge

Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.

The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.

We have also published a description for obtaining a smaller test set from a large corpus.

We plan to use this corpus to exemplify the Distributional Thesaurus computation.

German Trigram model available in the web demo and the API

May 5, 2014Updateseuge

We have now made the German trigram model available in the demo. This dataset is available on Sourceforge. The model was computed on 70 million German sentences from a news corpus.

Try it out in the JoBimText web demo!

With the introduction of German to the API, we now cover 4 languages: English, German, Hindi and Bengali.