Natural Language Processing in Ruby

Joey Robert picture Joey Robert · Jun 16, 2009 · Viewed 22.1k times · Source

I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?

Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!

Answer

user2398029 picture user2398029 · Apr 7, 2012

Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).

On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.

  • Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
  • Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
  • Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
  • WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
  • Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
  • Text retrieval with indexation and full-text search (ferret).
  • Named entity extraction (stanford-core-nlp).
  • Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
  • Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).

Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).