Creating small test sets

If you want to try out JoBimText and e.g. see whether it runs on your configuration, you can create smaller test datasets from large corpora.

If your corpus is one big file you can just extract some random lines (e.g. 1 million), using:

shuf -n 1000000 FILE > SAMPLE_FILE

If your corpus consists of several files, you can navigate into the corpus folder and use the following line. You will need to adjust the -n parameter so that you get approximately the desired number of lines.

for i in $(find ./ -type f); do shuf -n 1000 $i >> SAMPLE_FILE; done

Some corpora contain empty lines. You can use sed to remove them from your sample file.

sed -i '/^$/d' SAMPLE_FILE

If you want to remove some lines that are too long or too short, you can also employ sed. This command removes lines with fewer than 50 characters (e.g. headings):

sed -i '/^.\{1,50\}$/d' SAMPLE_FILE

The same can be done with lines that are too long, in this case longer than 500 characters:

sed -i '/^.\{500,\}$/d' SAMPLE_FILE

Leave a Reply Cancel reply