Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

Michael Julson picture Michael Julson · Oct 21, 2008 · Viewed 10.5k times · Source

I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below.

What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the presidential candidate speeches to see what the theme or most used words are.

The complication, is that I need to do this on thousands of short documents, and then collections or categories of these documents.

My initial plan was to parse the document out, then filter common words - of, the, he, she, etc.. Then count the number of times the remaining words show up in the text (and overall collection/category).

The problem is that in the future, I would like to handle stemming, plural forms, etc.. I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

Any guidance on a strategy, libraries or algorithms that would help are appreciated.

Answer

Robert Elwell picture Robert Elwell · Oct 21, 2008

One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf

Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.

As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/

For Java, try OpenNLP: http://opennlp.sourceforge.net/

For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.

Good luck!