All posts by euge

JoBimText Web Demo: Multi-Word support and Medline Multi-Word model realeased

Multi-Word Support

The JoBimText web demo  graphical user interface has been extended to feature multi-word JoBimText models. The user can choose the desired token length of the multiwords (Jo-length) he wants to select and select the multiwords using horizontal bars that span the desired tokens.

Screenshot of the GUI featuring Medline model

 

There are multiple views available for different presentation purposes: Graph, List and Table. Furthermore, the corpus count of the selected (multi-)word is now displayed prominently.

To try it out, go to the web demo site and select the “medline” dataset using the green dropdown selector at the top. Then you can see the visualiziation by entering a sentence and clicking on “Parse (medlineTrigram)”.

Medline Multi-Word Dataset

We have computed a Medline JoBimText model which is available directly within the demo.

Multi-word items with up to 3 tokens can be used via the graphical interface or the API. The specialty of this model is that similarities were computed among  Jos lengths of 1 to 4 tokens. Thus, e.g. breast carcinoma (2 tokens)  is not only related to breast cancer (2 tokens), but also to carcinoma of the breast (4 tokens) or melanoma (1 token).

This versatility of this model makes provides many advantages for life science and medical applications.

Test Corpus released, 1M Sentences

Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.

The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.

We have also published a description for obtaining a smaller test set from a large corpus.

We plan to use this corpus to exemplify the Distributional Thesaurus computation.

New JoBimText model released: English news trigram

We have released an English trigram model, which consists of a Distributional Thesaurus (Simsort.gz), word counts (wordcount.gz), significance scores between terms and features (LMI.gz) and sense clusters with IS-A labels (cluster_isa.gz).

The holing operation was performed with a TrigramHolingAnnotator, where the target Jo is in the middle of the trigram. The features look like the following:

Sentence:  Mary likes candy

Features:

Jo Bim
Mary 3-gram2(_@_likes)
likes 3-gram2(Mary_@_candy)
candy 3-gram2(likes_@_)

This dataset is included in the JoBimText web demo, where it can be tried out.