Python: NLTK and TextBlob in french

Question 1

Python: NLTK and TextBlob in french

python nltk textblob

Sulli · Feb 6, 2017 · Viewed 8.2k times · Source

Answer

Answer

By default NLTK uses the English tokenizer, which will have strange or undefined behavior in French.

@fpierron is correct. If you read the article it mentions, you simply have to load the correct tokenizer language model and use it in your program.

import nltk.data
#chargement du tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
tokens = tokenizer.tokenize("Jadis, une nuit, je fus un papillon, voltigeant, content de son sort. Puis, je m’éveillai, étant Tchouang-tseu. Qui suis-je en réalité ? Un papillon qui rêve qu’il est Tchouang-tseu ou Tchouang qui s’imagine qu’il fut papillon ?")

print(tokens) 

['Jadis, une nuit, je fus un papillon, voltigeant, content de son sort.', 'Puis, je m’éveillai, étant Tchouang-tseu.', 'Qui suis-je en réalité ?', 'Un papillon qui rêve qu’il est Tchouang-tseu ou Tchouang qui s’imagine qu’il fut papillon ?']

If you don't have the correct file you can use "nltk.download()" to download the correct model for french.

if you look at NLTKs website on the tokenizer, there are some other examples. http://www.nltk.org/api/nltk.tokenize.html

Question 2

I'm using NLTK and TextBlob to find nouns and noun phrases in a text:

from textblob import TextBlob 
import nltk

blob = TextBlob(text)
print(blob.noun_phrases)
tokenized = nltk.word_tokenize(text)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print(nouns)

This works fine if my text is in english but it's not good anymore if my text is in french.

I was unable to find how to adapt this code for french language, how do I do that?

And is there a list somewhere of all the languages that are possible to parse?

Python: NLTK and TextBlob in french

Answer

Related questions