How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

Question 1

How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

python string unicode nlp cjk

Continuation · Sep 26, 2010 · Viewed 15.9k times · Source

Answer

Answer

You can do this but not with standard library functions. And regular expressions won't help you either.

The task you are describing is part of the field called Natural Language Processing (NLP). There has been quite a lot of work done already on splitting Chinese words at word boundaries. I'd suggest that you use one of these existing solutions rather than trying to roll your own.

Where does the ambiguity come from?

What you have listed there is Chinese characters. These are roughly analagous to letters or syllables in English (but not quite the same as NullUserException points out in a comment). There is no ambiguity about where the character boundaries are - this is very well defined. But you asked not for character boundaries but for word boundaries. Chinese words can consist of more than one character.

If all you want is to find the characters then this is very simple and does not require an NLP library. Simply decode the message into a unicode string (if it is not already done) then convert the unicode string to a list using a call to the builtin function list. This will give you a list of the characters in the string. For your specific example:

>>> list(u"这是一个句子")

Question 2

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> "This is a sentence.".split()
['This', 'is', 'a', 'sentence.']

But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator.

>>> u"这是一个句子".split()
[u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50']

Obviously that doesn't work.

How do I split such a sentence into a list of words?

UPDATE:

So far the answers seem to suggest that this requires natural language processing techniques and that the word boundaries in Chinese are ambiguous. I'm not sure I understand why. The word boundaries in Chinese seem very definite to me. Each Chinese word/character has a corresponding unicode and is displayed on screen as an separate word/character.

So where does the ambiguity come from. As you can see in my Python console output Python has no problem telling that my example sentence is made up of 5 characters:

这 - u8fd9
是 - u662f
一 - u4e00
个 - u4e2a
句 - u53e5
子 - u5b50

So obviously Python has no problem telling the word/character boundaries. I just need those words/characters in a list.

How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

Answer

Related questions