Multilingual NLTK for POS Tagging and Lemmatizer

Question 1

Multilingual NLTK for POS Tagging and Lemmatizer

python nlp nltk pos-tagger lemmatization

Alessio Schiavelli · Sep 23, 2015 · Viewed 10.1k times · Source

Answer

Answer

If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.

In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:

python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon 
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")

Question 2

Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:

tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')

I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?

Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.

Many Thanks.

Multilingual NLTK for POS Tagging and Lemmatizer

Answer

Related questions