Extracting Words using nltk from German Text

red picture red · Feb 5, 2012 · Viewed 15.8k times · Source

I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))

What should I do to get the list of words in the document?

An example with nltk.tokenize.WordPunctTokenizer() for the german phrase Veränderungen über einen Walzer looks like:

In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")

Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']

In this example "ä" is treated as a delimiter,even though "ü" is not.

Answer

alexis picture alexis · Feb 6, 2012

Call PlaintextCorpusReader with the parameter encoding='utf-8':

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')

Edit: I see... you have two separate problems here:

a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file.

# -*- coding: utf-8 -*-

All of a sudden your constants will be seen and tokenized correctly:

german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)

Second problem: It turns out that Text() does not use unicode! If you pass it a unicode string, it will try to convert it to a pure-ascii string, which of course fails on non-ascii input. Ugh.

Solution: My recommendation would be to avoid using nltk.Text entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text's own documentation).

But if you must use nltk.Text with German data, here's how: Read your data properly so it can be tokenized, but then "encode" your unicode back to a list of str. For German, it's probably safest to just use the Latin-1 encoding, but utf-8 seems to work too.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');

# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)