Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

chrysanhy picture chrysanhy · Apr 17, 2011 · Viewed 7.1k times · Source

My question is in regards to running the following grammar in ANTLRWorks:

INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);

statement
    : STMTEND
    | INT STMTEND
    ;

program: statement+;

I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:

"; NL" or "32; NL" parses without error. ";" or "45;" (without newlines) result in EarlyExitException. "NL" by itself parses without error. "456 NL", without the semicolon, results in MismatchedTokenException.

What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.

So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!

Answer

Bart Kiers picture Bart Kiers · Apr 18, 2011

In case of input like ";" or "45;", the token STMTEND will never be created.

";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.

What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:

program: statement+;

statement
 : STMTEND
 | INT STMTEND
 ;

INT     : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;

fragment SEMICOLON : ';';
fragment NEWLINE   : '\r' '\n' | '\n' | '\r';

Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.