Python's NLTK vs. related Java Libraries?

wnewport picture wnewport · Apr 8, 2011 · Viewed 12.2k times · Source

I've used LingPipe, Stanford's NER, RiTa and various sentence similarity libraries for my previous Java projects that focused on text (pre)processing (indexing, xml tagging, topic detection, etc.) of large amounts of English text (around 10,000 documents summing to > 1gb of text). Maybe I'm a bad Java programmer, but I find myself typing a lot of code and using a lot of libraries when I switch to a different corpus. Overall, I feel like there might be a better tool for the job.

I guess my question is, will I benefit from switching to Python and NLTK for information retrieval / language processing? Or are there enough pros and cons to make it very subjective? Is NLTK intuitive enough to be learned quickly?

I'd get my hands dirty, but I won't have access to a personal machine for the next few days.

Answer

lamwaiman1988 picture lamwaiman1988 · Apr 8, 2011

NLTK is good for natural language processing. I've used it for my data-mining project. You can train your own analyzer. The learning curve is not steep.

NLTK got huge corpus for training of your analyzer. You can also provide your own set of data, for example, a journal which a part-of-speech tagged.

Because python is very good for text processing, you may to give it a try. Plus, it got a online tutorial

Please don't forget to use python 2.x version. Try python 2.6. NLTK may not be good with python 3.x