Regular expression with Chinese characters and full/half-width charatcers

BratAnon picture BratAnon · Nov 9, 2015 · Viewed 7.5k times · Source

I'm doing validation rules for a java project and one of the requirements I got is:

"The ID card address should contain no less than eight (≥8) Chinese characters (exclusive of full-width/half-width symbols)."

I can't get my head around how to solve this.

I have come to the point where I can validate for Chinese characters but are not able to exclude all the full-width/half-width symbols.

return Pattern.matches("^[\\p{IsHan}]{8,}$", address);

Results should be something like

  • 名字名字名字名字 = true
  • 名字名字名字名(字)= true
  • 名字名字名(字) = false
  • 名字名字名(字)= false

Does anyone have any advice?

Answer

nhahtdh picture nhahtdh · Nov 10, 2015

Assuming that you want to check that there are 8 or more Chinese characters in the string:

Pattern.compile("^(\\P{sc=Han}*\\p{sc=Han}){8}.*$", Pattern.DOTALL);

Since it's unclear what you consider Chinese character, I'm using Han script as an approximation. According to Unicode 6.2.0, Han script is defined to contain the following code points:

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3    ; Han # So  [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5    ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
3005          ; Han # Lm       IDEOGRAPHIC ITERATION MARK
3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A    ; Han # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
3400..4DB5    ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FCC    ; Han # Lo [20941] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FCC
F900..FA6D    ; Han # Lo [366] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA6D
FA70..FAD9    ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
20000..2A6D6  ; Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6
2A700..2B734  ; Han # Lo [4149] CJK UNIFIED IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B734
2B740..2B81D  ; Han # Lo [222] CJK UNIFIED IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2F800..2FA1D  ; Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D

Java 8 is using Unicode 6.2.0, so \p{sc=Han} matches the code points listed above. However, the implementation also includes unassigned code points (in assigned blocks) and unassigned blocks, so do take note to upgrade the JRE to the latest major version to make sure the program runs correctly as more characters are added to Unicode.

In particular, \p{sc=Han} in Oracle's implementation includes these ranges:

  • U+2E80 - U+2FEF: CJK Radicals Supplement (whole block), Kangxi Radicals (whole block) and 16 code points from unassigned block.
  • U+3005, U+3007, U+3021 - U+3029, U+3038 - U+303B: CJK Symbols and Punctuation (some characters in the block)
  • U+3400 - U+4DBF: CJK Unified Ideographs Extension A (whole block)
  • U+4E00 - U+9FFF: CJK Unified Ideographs (whole block)
  • U+F900 - U+FAFF: CJK Compatibility Ideographs (whole block)
  • U+20000 - U+E0000: CJK Unified Ideographs Extension B/C/D/E (whole blocks), CJK Compatibility Ideographs Supplement (whole block), and several unassigned Unicode plane, plus one reserved code point in Tags block.