List of Natural Language Processing Tools in Regards to Sentiment Analysis - Which one do you recommend

Chriswede picture Chriswede · Sep 6, 2012 · Viewed 7k times · Source

first up sorry for my not so perfect English... I am from Germany ;)

So, for a research project of mine (Bachelor thesis) I need to analyze the sentiment of tweets about certain companies and brands. For this purpose I will need to script my own program / use some sort of modified open source code (no APIs' - I need to understand what is happening).

Below you will find a list of some of the NLP Applications I found. My Question now is which one and which approach would you recommend? And which one does not require long nights adjusting the code?

For example: When I screen twitter for the music player >iPod< and someone writes: "It's a terrible day but at least my iPod makes me happy" or even harder: "It's a terrible day but at least my iPod makes up for it"

Which software is smart enough to understand that the focused is on iPod and not the weather?

Also which software is scalable / resource efficient (I want to analyze several tweets and don't want to spend thousands of dollars)?

Machine learning and data mining

Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.

NLP Tools

LingPipe - (not technically 'open-source, see below) Alias-I's Lingpipe is a suite of java tools for linguistic processing of text including entity extraction, speech tagging (pos) , clustering, classification, etc... It is one of the most mature and widely used open source NLP toolkits in industry. It is known for it's speed, stability, and scalability. One of its best features is the extensive collection of well-written tutorials to help you get started. They have a list of links to competition, both academic and industrial tools. Be sure to check out their blog. LingPipe is released under a royalty-free commercial license that includes the source code, but it's not technically 'open-source'.

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.

Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsing and part of speech tagging from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.

OpenFST - A package for manipulating weighted finite state automata. These are often used to represented a probablistic model. They are used to model text for speech recognition, OCR error correction, machine translation, and a variety of other tasks. The library was developed by contributors from Google Research and NYU. It is a C++ library that is meant to be fast and scalable.

NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.

Opinion Finder - A system that performs subjectivity analysis, automatically identifying when opinions, sentiments, speculations and other private states are present in text. Specifically, OpinionFinder aims to identify subjective sentences and to mark various aspects of the subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.

Tawlk/osae - A python library for sentiment classification on social text. The end-goal is to have a simple library that "just works". It should have an easy barrier to entry and be thoroughly documented. We have acheived best accuracy using stopwords filtering with tweets collected on negwords.txt and poswords.txt

GATE - GATE is over 15 years old and is in active use for all types of computational task involving human language. GATE excels at text analysis of all shapes and sizes. From large corporations to small startups, from €multi-million research consortia to undergraduate projects, our user community is the largest and most diverse of any system of this type, and is spread across all but one of the continents1.

textir - A suite of tools for text and sentiment mining. This includes the ‘mnlm’ function, for sparse multinomial logistic regression, ‘pls’, a concise partial least squares routine, and the ‘topics’ function, for efficient estimation and dimension selection in latent topic models.

NLP Toolsuite - The JULIE Lab here offers a comprehensive NLP tool suite for the application purposes of semantic search, information extraction and text mining. Most of our continuously expanding tool suite is based on machine learning methods and thus is domain- and language independent.

...

On a side note: Would you recommend the twitter streaming or the get API?

As to me, I am a fan of python and java ;)

Thanks a lot for your help!!!

Answer

Paul W picture Paul W · Sep 7, 2012

I'm not sure how much I can help, but I have worked with hand-rolled NLP before. A couple of issues come to mind - not all products are language agnostic (human language that is, not computer language). If you're planning on analysing German tweets, it's going to be important that your selected product is able to handle the German language. Obvious I know, but easy to forget. Then there's the fact that it's twitter where contractions and acronyms abound, and the language structure is constrained by the character limit which means that the grammar won't always match the expected structure of the language.

In English, pulling nouns from a sentence can be simplified somewhat if you ever have to write code of your own. Proper nouns have initial capitals and a string of such words (possibly including "of") is an example of a noun phrase. A word preceeded by "a/an/my/his/hers/the/this/these/those" is going to be either an adjective or a noun. It gets harder after that unfortunately.

There are rules which help identify plurals, but there are also lots of exceptions. I'm talking about English here of course, my very poor spoken German doesn't help me understand that grammar I'm afraid.