I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay"
I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is 8212(dec) or x2014(hex).How do I split the string using unicode ???
You may be mistaken in which Unicode dash character you’re getting. As of Unicode v6.1, there are 27 code points that have the \p{Dash}
property:
U+002D - HYPHEN-MINUS
U+058A ֊ ARMENIAN HYPHEN
U+05BE ־ HEBREW PUNCTUATION MAQAF
U+1400 ᐀ CANADIAN SYLLABICS HYPHEN
U+1806 ᠆ MONGOLIAN TODO SOFT HYPHEN
U+2010 ‐ HYPHEN
U+2011 ‑ NON-BREAKING HYPHEN
U+2012 ‒ FIGURE DASH
U+2013 – EN DASH
U+2014 — EM DASH
U+2015 ― HORIZONTAL BAR
U+2053 ⁓ SWUNG DASH
U+207B ⁻ SUPERSCRIPT MINUS
U+208B ₋ SUBSCRIPT MINUS
U+2212 − MINUS SIGN
U+2E17 ⸗ DOUBLE OBLIQUE HYPHEN
U+2E1A ⸚ HYPHEN WITH DIAERESIS
U+2E3A ⸺ TWO-EM DASH
U+2E3B ⸻ THREE-EM DASH
U+301C 〜 WAVE DASH
U+3030 〰 WAVY DASH
U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ﹘ SMALL EM DASH
U+FE63 ﹣ SMALL HYPHEN-MINUS
U+FF0D - FULLWIDTH HYPHEN-MINUS
In Perl or ICU, you could just split directly on \p{dash}
, but since the Sun Pattern
class doesn’t support full Unicode properties like that, you have to synthesize it with an enumerated square-bracketed character class. So splitting on the pattern:
string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")
should do the trick for you. You can actually double-backslash those if you fear for the Java preprocessor getting in your way, because the regex parser should know to understand the alternate notation.