Java regex for support Unicode?

cometta picture cometta · Jun 5, 2012 · Viewed 70.2k times · Source

To match A to Z, we will use regex:

[A-Za-z]

How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部

Answer

stema picture stema · Jun 5, 2012

What you are looking for are Unicode properties.

e.g. \p{L} is any kind of letter from any language

So a regex to match such a Chinese word could be something like

\p{L}+

There are many such properties, for more details see regular-expressions.info

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links

You could do something like this

Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);

and \w would match all letters and all digits from any languages (and of course some word combining characters like _).