I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}
, which gives me é, ú ü ã, and so forth. The example is:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What is wrong with this regex? Why it matches symbols like "("
, "+"
, "-"
, "*"
and "|"
?
Some of results are:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
The regex explanation is:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.
If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
\p{L}
will match letter in any script. Change \\p{IsLatin}
to \\pL
if your version of Java doesn't support the syntax.The regex above is to be used by calling Pattern.compile
, and call matcher(String input)
to obtain a Matcher
object, and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophe '
:
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I also escape -
in the character class ['\\-]
just in case you want to add more. Actually -
doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.