Regex and escaped and unescaped delimiter

lstipakov picture lstipakov · Oct 26, 2011 · Viewed 13.4k times · Source

question related to this

I have a string

a\;b\\;c;d

which in Java looks like

String s = "a\\;b\\\\;c;d"

I need to split it by semicolon with following rules:

  1. If semicolon is preceded by backslash, it should not be treated as separator (between a and b).

  2. If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c).

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

For example above, I want to get following strings (double backslashes for java compiler):

a\;b\\
c
d

Answer

Tim Pietzcker picture Tim Pietzcker · Oct 26, 2011

You can use the regex

(?:\\.|[^;\\]++)*

to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 

Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.