Are there APIs for text analysis/mining in Java?

Renato Dinhani picture Renato Dinhani · Jul 23, 2011 · Viewed 15.6k times · Source

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Are there APIs for text analysis in Java?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

Answer

William Niu picture William Niu · Jul 26, 2011

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:


In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you define the meaning of those columns via a map. One column should be called "answer" and has the NER class, and existing features know about names like "word" and "tag". You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output