I'm doing validation rules for a java project and one of the requirements I got is:
"The ID card address should contain no less than eight (≥8) Chinese characters (exclusive of full-width/half-width symbols)."
I can't get my head around how to solve this.
I have come to the point where I can validate for Chinese characters but are not able to exclude all the full-width/half-width symbols.
return Pattern.matches("^[\\p{IsHan}]{8,}$", address);
Results should be something like
Does anyone have any advice?
Assuming that you want to check that there are 8 or more Chinese characters in the string:
Pattern.compile("^(\\P{sc=Han}*\\p{sc=Han}){8}.*$", Pattern.DOTALL);
Since it's unclear what you consider Chinese character, I'm using Han script as an approximation. According to Unicode 6.2.0, Han script is defined to contain the following code points:
2E80..2E99 ; Han # So [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3 ; Han # So [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5 ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
3005 ; Han # Lm IDEOGRAPHIC ITERATION MARK
3007 ; Han # Nl IDEOGRAPHIC NUMBER ZERO
3021..3029 ; Han # Nl [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A ; Han # Nl [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
303B ; Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK
3400..4DB5 ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FCC ; Han # Lo [20941] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FCC
F900..FA6D ; Han # Lo [366] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA6D
FA70..FAD9 ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
20000..2A6D6 ; Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6
2A700..2B734 ; Han # Lo [4149] CJK UNIFIED IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B734
2B740..2B81D ; Han # Lo [222] CJK UNIFIED IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2F800..2FA1D ; Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D
Java 8 is using Unicode 6.2.0, so \p{sc=Han}
matches the code points listed above. However, the implementation also includes unassigned code points (in assigned blocks) and unassigned blocks, so do take note to upgrade the JRE to the latest major version to make sure the program runs correctly as more characters are added to Unicode.
In particular, \p{sc=Han}
in Oracle's implementation includes these ranges: