I needed to strip the Chinese out of a bunch of strings today and was looking for a simple Python regex. Any suggestions?
Python 2:
#!/usr/bin/env python
# -*- encoding: utf8 -*-
import re
sample = u'I am from 美国。We should be friends. 朋友。'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n
Python 3:
sample = 'I am from 美国。We should be friends. 朋友。'
for n in re.findall(r'[\u4e00-\u9fff]+', sample):
print(n)
Output:
美国
朋友
About Unicode code blocks:
The 4E00—9FFF
range covers CJK Unified Ideographs (CJK=Chinese, Japanese and Korean). There are a number of lower ranges that relate, to some degree, to CJK:
31C0—31EF CJK Strokes
31F0—31FF Katakana Phonetic Extensions
3200—32FF Enclosed CJK Letters and Months
3300—33FF CJK Compatibility
3400—4DBF CJK Unified Ideographs Extension A
4DC0—4DFF Yijing Hexagram Symbols
4E00—9FFF CJK Unified Ideographs