Unclosed character class near index nnn

javafueled picture javafueled · Nov 14, 2011 · Viewed 25.9k times · Source

I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).

Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback in PHP) fail in Java with the following exception:

java.util.regex.PatternSyntaxException: Unclosed character class near index 217

The statement is obvious, the solution is elusive.

Here's the raw, multiline regex from the PHP implementation:

return preg_replace_callback('/
    (^|(?<=[\s>.\(])|[{[]) # $pre
    "                      # start
    (' . $this->c . ')     # $atts
    ([^"]+?)               # $text
    (?:\(([^)]+?)\)(?="))? # $title
    ":
    ('.$this->urlch.'+?)   # $url
    (\/)?                  # $slash
    ([^\w\/;]*?)           # $post
    ([\]}]|(?=\s|$|\)))
    /x',callback,input);

Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo that resulted in the following, rather long, regular expression:

(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))

I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.

I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.

Any ideas?

I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.

In #title switch the escaped paren:

        (?:\(([^)]+?)\)(?="))? # $title
        ...^
        (?:(\([^)]+?)\)(?="))? # $title
        ....^

Thanks, Tim

edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...

"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"

Answer

Alan Moore picture Alan Moore · Nov 15, 2011

@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []] or [^]] are okay because the ] is the first character other than the negating ^, but in Java an unescaped [ anywhere in a character class is a syntax error.

Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes / because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:

"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"

Whether it's the best regex I have no idea, not knowing how it's being used.