The JoBimText web demo graphical user interface has been extended to feature multi-word JoBimText models. The user can choose the desired token length of the multiwords (Jo-length) he wants to select and select the multiwords using horizontal bars that span the desired tokens.
There are multiple views available for different presentation purposes: Graph, List and Table. Furthermore, the corpus count of the selected (multi-)word is now displayed prominently.
To try it out, go to the web demo site and select the “medline” dataset using the green dropdown selector at the top. Then you can see the visualiziation by entering a sentence and clicking on “Parse (medlineTrigram)”.
Medline Multi-Word Dataset
We have computed a Medline JoBimText model which is available directly within the demo.
Multi-word items with up to 3 tokens can be used via the graphical interface or the API. The specialty of this model is that similarities were computed among Jos lengths of 1 to 4 tokens. Thus, e.g. breast carcinoma (2 tokens) is not only related to breast cancer (2 tokens), but also to carcinoma of the breast (4 tokens) or melanoma (1 token).
This versatility of this model makes provides many advantages for life science and medical applications.
Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.
The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.
We have also published a description for obtaining a smaller test set from a large corpus.
We plan to use this corpus to exemplify the Distributional Thesaurus computation.
We have now made the German trigram model available in the demo. This dataset is available on Sourceforge. The model was computed on 70 million German sentences from a news corpus.
Try it out in the JoBimText web demo!
With the introduction of German to the API, we now cover 4 languages: English, German, Hindi and Bengali.
We have released an English trigram model, which consists of a Distributional Thesaurus (Simsort.gz), word counts (wordcount.gz), significance scores between terms and features (LMI.gz) and sense clusters with IS-A labels (cluster_isa.gz).
The holing operation was performed with a TrigramHolingAnnotator, where the target Jo is in the middle of the trigram. The features look like the following:
Sentence: Mary likes candy
This dataset is included in the JoBimText web demo, where it can be tried out.
We have just published a new JoBimText pipeline, v. 0.0.7.
It contains a major overhaul, refactoring and combining of packages into a clearer structure. The pipeline can be downloaded at https://sourceforge.net/projects/jobimtext/files/