This site gives an overview of the JoBimText Web Demo and its API.
JoBimText is an open source Distributional Semantics framework that can produce lexical resources from large text corpora. For an overview, see Biemann & Riedl (2013).
If you want to install the demo in your infrastructure, consult the technical documentation.
For questions and inquiries, contact Martin Riedl, Manuel Kaufmann or Eugen Ruppert at the Language Technology group at the TU Darmstadt, Germany.
- Biemann, C., Riedl, M. (2013): Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity. Journal of Language Modelling 1(1):55–95 (BiemannRiedlText2D.pdf)
Contents
Web Demo
The web demo allows to perform semantification
of sentences.
It processes the sentence (dependency parsing or bigram/trigram feature extraction) and retrieves similar terms for each word in the sentence.
Holing methods
For sentence processing, the user can select different holing methods (feature extraction) and different languages.
Currently, English and German are supported. Furthermore, theses holing methods are available:
- stanford
- Dependency parsing with the Stanford parser
- bigram
- Bigram feature extraction (neighboring words, left or right neighbors)
- trigram
- Trigram feature extraction (neighboring words, target word is always in the middle)
Precomputed JoBimText models
We have precomputed JobimText models for several languages.
Currently, only the English models contain ISA information. Sense clusters were computed for all models.
- English Stanford model
- English trigram model
- German trigram model
- Hindi bigram model
- Bengali bigram model
The models contain the following files. The file format of the TAB-Separated files is explained below:
- Word Count: word, count (*WordCount.gz)
- Feature Count (optional): feature, count (*FeatureCount.gz)
- Word-Feature Scores: word, feature, sig, count (*FreqSigLMI_s_0_t_0.gz)
- Simiarity Graph: word1, word2, count (*FreqSigLMI…SimSortlimit_l_200.gz)
- Sense Clusters: word, cluster_id, cluster (*Senses_nXX.gz)
Additional models from the web demo will be made available on request.
API
Methods
Our API offers the following GET and POST methods for access:
GET /holing/{holingtype}?s={sentence}
POST /holing/{holingtype}
GET /api/{holingtype}/jo/similar/{term}
GET /api/{holingtype}/jo/count/{term}
GET /api/{holingtype}/jo/senses/{term}
GET /api/{holingtype}/jo/isas/{term}
GET /api/{holingtype}/jo/sense-cuis/{term}
GET /api/{holingtype}/jo/similar-score/{term1}/{term2}
GET /api/{holingtype}/bim/count/{term}
GET /api/{holingtype}/jo/bim/count/{term}/{context}
GET /api/{holingtype}/jo/bim/score/{term}/{context}
GET /api/{holingtype}/jo/bim/score/{term}/
Response
The response is a structured JSON Document with the output or an error message, if the operation was not successful.
Furtermore, other formats can be selected, by supplying the URL parameter format=rdf
for RDF, format=xml
for XML and format=tsv
for TSV (tab-separated values)
output.
Examples
We exemplify the API by processing the sentence The cat chases mice.
Sentence Processing
- Sentence processing with default JSON output
- Sentence Processing with RDF output
- Sentence Processing with XML output
- Sentence Processing with TSV output
XML, RDF and TSV output contains links to the distributional definitions
of the terms (that are called Jos in our framework) and features (Bims).
These can be used for term or context operations.
Term operations
- Similar terms to cat#NN (JSON)
- Count of mouse#NN in the corpus (XML)
- Sense cluster for mouse#NN (TSV)
- ISAs for cat#NN (RDF)Note, that JSON, XML and RDF term methods
jo/senses, jo/isas
andjo/sense-cuis
contain results for all three methods. Only in TSV output each method produces its own output, due to the format. - Sense IDs for mouse#NN (XML)
- Similarity score between mouse#NN and cat#NN (JSON)
Context operations
- Count of the the feature
mouse (direct object)
(JSON) - Count of chase#VB and
mouse (direct object)
(RDF) - Related features for mouse#NN (TSV)
- Scores for the term–context combination cat#NN and
chase (is subject of)
(XML)
Format
The output contains information on the holing operation, API method, error messages and the actual result.
JSON and XML contain descriptive names, like <sense>
or error
.
TSV gives a description of the format in the last comment line at the top, e.g. # Sense TAB URI
.
For RDF, which is based on XML, we created an RDF Schema. We welcome suggestions for the RDF Schema, as we try to make it easy to use and compatible to other open data APIs.
Terms of use
The web demo can be used without requirements and free of charge.
We can provide support but can offer no warranty.
To understand the usage better, to provide faster access and to improve the application in the desired direction, we are going to monitor and log user input. We only log the input and the method. No identifiable information like time, IP address or browser agent is collected.