I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.
fragment CHAR : ('A'..'Z') | ('a'..'z');
fragment DIGIT : ('0'..'9');
fragment UNICODE : '\u0000'..'\u00FF';
Now if I write my token rule as:
TOKEN : (CHAR|DIGIT|UNICODE)+;
I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"
And nothing gets matched: And also if I write it as
TOKEN : (UNICODE)+;
Nothing gets matched.
Is there a way of doing this.
One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary
option to say that you want to allow any char in the Unicode range of 0 through FFFE
options
{
charVocabulary='\u0000'..'\uFFFE';
}
The default you'll usually see in the examples is
options
{
charVocabulary = '\3'..'\377';
}
To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z'
and the unicode range you'd make a unicode lexer rule like:
'\u0080'..'\ufffe'