My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error. ";" or "45;" (without newlines) result in EarlyExitException. "NL" by itself parses without error. "456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";"
or "45;"
, the token STMTEND
will never be created.
";"
will create a single token: SEMICOLON
, and "45;"
will produce: INT SEMICOLON
.
What you (probably) want is that SEMICOLON
and NEWLINE
never make it to real tokens themselves, but they will always be a STMTEND
. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT
or STMTEND
tokens.