I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).
Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback
in PHP) fail in Java with the following exception:
java.util.regex.PatternSyntaxException: Unclosed character class near index 217
The statement is obvious, the solution is elusive.
Here's the raw, multiline regex from the PHP implementation:
return preg_replace_callback('/
(^|(?<=[\s>.\(])|[{[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\(([^)]+?)\)(?="))? # $title
":
('.$this->urlch.'+?) # $url
(\/)? # $slash
([^\w\/;]*?) # $post
([\]}]|(?=\s|$|\)))
/x',callback,input);
Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo
that resulted in the following, rather long, regular expression:
(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))
I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.
I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.
Any ideas?
I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.
In #title
switch the escaped paren:
(?:\(([^)]+?)\)(?="))? # $title
...^
(?:(\([^)]+?)\)(?="))? # $title
....^
Thanks, Tim
edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...
"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"
@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []]
or [^]]
are okay because the ]
is the first character other than the negating ^
, but in Java an unescaped [
anywhere in a character class is a syntax error.
Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes /
because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:
"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"
Whether it's the best regex I have no idea, not knowing how it's being used.