Distributional Similarity with MapReduce

This page summarizes the MapReduce workflow of JoBimText. This workflow is used to compute distributional similarity of words (Jo, Language Element) and features (Bim, Context Feature). As of JoBimText v. 0.0.8, the Context Feature Extractor (@@ operation) is integrated as a MapReduce job. The Distributional Thesaurus computation consists of MapReduce steps implemented on Hadoop and Pig.

Control flow overview

JoBimText pipeline 0.0.8

Explanation of parts

  • Language Element Count: hadoop
  • Context Feature Count: hadoop
  • Language Element — Context Feature Count: hadoop
  • Frequency Significance Measure: FreqSigLL, FreqSigLMI, FreqSigPMI, FreqSigFreq: pig script
  • Pruning: pig script
  • Aggregation Per Feature:  hadoop
  • Similarity Counts: SimCounts1, SimCountNormalized, SimCountsLog: hadoop
  • Similarity Sort: pig script

Leave a Reply

Your email address will not be published. Required fields are marked *