Monthly Archives: May 2014

Test Corpus released, 1M Sentences

Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.

The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.

We have also published a description for obtaining a smaller test set from a large corpus.

We plan to use this corpus to exemplify the Distributional Thesaurus computation.