Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.
The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.
We have also published a description for obtaining a smaller test set from a large corpus.
We plan to use this corpus to exemplify the Distributional Thesaurus computation.