All posts by euge

JoBimText Web Demo: Multi-Word support and Medline Multi-Word model realeased

July 11, 2014Updateseuge

Multi-Word Support

The JoBimText web demo graphical user interface has been extended to feature multi-word JoBimText models. The user can choose the desired token length of the multiwords (Jo-length) he wants to select and select the multiwords using horizontal bars that span the desired tokens.

There are multiple views available for different presentation purposes: Graph, List and Table. Furthermore, the corpus count of the selected (multi-)word is now displayed prominently.

To try it out, go to the web demo site and select the “medline” dataset using the green dropdown selector at the top. Then you can see the visualiziation by entering a sentence and clicking on “Parse (medlineTrigram)”.

Medline Multi-Word Dataset

We have computed a Medline JoBimText model which is available directly within the demo.

Multi-word items with up to 3 tokens can be used via the graphical interface or the API. The specialty of this model is that similarities were computed among Jos lengths of 1 to 4 tokens. Thus, e.g. breast carcinoma (2 tokens) is not only related to breast cancer (2 tokens), but also to carcinoma of the breast (4 tokens) or melanoma (1 token).

This versatility of this model makes provides many advantages for life science and medical applications.

Test Corpus released, 1M Sentences

May 28, 2014Updateseuge

Today we have released a test corpus with 1 million sentences. We extracted sentences at random from an English Wikipedia dump.

The test set can be downloaded from Sourceforge: wikipedia_sample_1M.gz
Each sentence is on a separate line. This makes it easier for processing without a sentence splitter.

We have also published a description for obtaining a smaller test set from a large corpus.

We plan to use this corpus to exemplify the Distributional Thesaurus computation.

German Trigram model available in the web demo and the API

May 5, 2014Updateseuge

We have now made the German trigram model available in the demo. This dataset is available on Sourceforge. The model was computed on 70 million German sentences from a news corpus.

Try it out in the JoBimText web demo!

With the introduction of German to the API, we now cover 4 languages: English, German, Hindi and Bengali.

New JoBimText model released: English news trigram

April 14, 2014Updateseuge

We have released an English trigram model, which consists of a Distributional Thesaurus (Simsort.gz), word counts (wordcount.gz), significance scores between terms and features (LMI.gz) and sense clusters with IS-A labels (cluster_isa.gz).

The holing operation was performed with a TrigramHolingAnnotator, where the target Jo is in the middle of the trigram. The features look like the following:

Sentence: Mary likes candy

Features:

Jo	Bim
Mary	3-gram2(_@_likes)
likes	3-gram2(Mary_@_candy)
candy	3-gram2(likes_@_)

This dataset is included in the JoBimText web demo, where it can be tried out.

New Release: JoBimText Pipeline 0.0.7

March 31, 2014Updateseuge

We have just published a new JoBimText pipeline, v. 0.0.7.

It contains a major overhaul, refactoring and combining of packages into a clearer structure. The pipeline can be downloaded at https://sourceforge.net/projects/jobimtext/files/