If you want to try out JoBimText and e.g. see whether it runs on your configuration, you can create smaller test datasets from large corpora.
If your corpus is one big file you can just extract some random lines (e.g. 1 million), using:
shuf -n 1000000 FILE > SAMPLE_FILE
If your corpus consists of several files, you can navigate into the corpus folder and use the following line. You will need to adjust the -n parameter
so that you get approximately the desired number of lines.
for i in $(find ./ -type f); do shuf -n 1000 $i >> SAMPLE_FILE; done
Some corpora contain empty lines. You can use sed
to remove them from your sample file.
sed -i '/^$/d' SAMPLE_FILE
If you want to remove some lines that are too long or too short, you can also employ sed.
This command removes lines with fewer than 50 characters (e.g. headings):
sed -i '/^.\{1,50\}$/d' SAMPLE_FILE
The same can be done with lines that are too long, in this case longer than 500 characters:
sed -i '/^.\{500,\}$/d' SAMPLE_FILE