Find all Chinese text in a string using Python and Regex

prairiedogg picture prairiedogg · Apr 27, 2010 · Viewed 36.9k times · Source

I needed to strip the Chinese out of a bunch of strings today and was looking for a simple Python regex. Any suggestions?

Answer

prairiedogg picture prairiedogg · Apr 27, 2010

Python 2:

#!/usr/bin/env python
# -*- encoding: utf8 -*-


import re

sample = u'I am from 美国。We should be friends. 朋友。'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
    print n

Python 3:

sample = 'I am from 美国。We should be friends. 朋友。'
for n in re.findall(r'[\u4e00-\u9fff]+', sample):
    print(n)

Output:

美国
朋友

About Unicode code blocks:

The 4E00—9FFF range covers CJK Unified Ideographs (CJK=Chinese, Japanese and Korean). There are a number of lower ranges that relate, to some degree, to CJK:

31C0—31EF CJK Strokes
31F0—31FF Katakana Phonetic Extensions
3200—32FF Enclosed CJK Letters and Months
3300—33FF CJK Compatibility
3400—4DBF CJK Unified Ideographs Extension A
4DC0—4DFF Yijing Hexagram Symbols
4E00—9FFF CJK Unified Ideographs