Spilt String using Unicode delimiter

Bhavya picture Bhavya · Mar 8, 2012 · Viewed 8.9k times · Source

I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay"

I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is 8212(dec) or x2014(hex).How do I split the string using unicode ???

Answer

tchrist picture tchrist · Mar 8, 2012

You may be mistaken in which Unicode dash character you’re getting. As of Unicode v6.1, there are 27 code points that have the \p{Dash} property:

U+002D ‭ -  HYPHEN-MINUS
U+058A ‭ ֊  ARMENIAN HYPHEN
U+05BE ‭ ־  HEBREW PUNCTUATION MAQAF
U+1400 ‭ ᐀  CANADIAN SYLLABICS HYPHEN
U+1806 ‭ ᠆  MONGOLIAN TODO SOFT HYPHEN
U+2010 ‭ ‐  HYPHEN
U+2011 ‭ ‑  NON-BREAKING HYPHEN
U+2012 ‭ ‒  FIGURE DASH
U+2013 ‭ –  EN DASH
U+2014 ‭ —  EM DASH
U+2015 ‭ ―  HORIZONTAL BAR
U+2053 ‭ ⁓  SWUNG DASH
U+207B ‭ ⁻  SUPERSCRIPT MINUS
U+208B ‭ ₋  SUBSCRIPT MINUS
U+2212 ‭ −  MINUS SIGN
U+2E17 ‭ ⸗  DOUBLE OBLIQUE HYPHEN
U+2E1A ‭ ⸚  HYPHEN WITH DIAERESIS
U+2E3A ‭ ⸺  TWO-EM DASH
U+2E3B ‭ ⸻  THREE-EM DASH
U+301C ‭ 〜 WAVE DASH
U+3030 ‭ 〰 WAVY DASH
U+30A0 ‭ ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ‭ ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ‭ ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ‭ ﹘ SMALL EM DASH
U+FE63 ‭ ﹣ SMALL HYPHEN-MINUS
U+FF0D ‭ - FULLWIDTH HYPHEN-MINUS

In Perl or ICU, you could just split directly on \p{dash}, but since the Sun Pattern class doesn’t support full Unicode properties like that, you have to synthesize it with an enumerated square-bracketed character class. So splitting on the pattern:

string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")

should do the trick for you. You can actually double-backslash those if you fear for the Java preprocessor getting in your way, because the regex parser should know to understand the alternate notation.