Antlr rule priorities

probably at the beach picture probably at the beach · Feb 4, 2011 · Viewed 7.6k times · Source

Firstly I know this grammar doesn't make sense but it was created to test out the ANTLR rule priority behaviour

grammar test;

options 
{

output=AST;
backtrack=true;
memoize=true;

}

rule_list_in_order :
    (
    first_rule
    | second_rule
    | any_left_over_tokens)+
    ;


first_rule
    :
     FIRST_TOKEN
    ;


second_rule:     
    FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE;


any_left_over_tokens
    :
    NEW_LINE
    | FIRST_TOKEN
    | SECOND_TOKEN;



FIRST_TOKEN
    : 'First token here'
    ;   

SECOND_TOKEN
    : 'Second token here';

NEW_LINE
    : ('\r'?'\n')   ;

WS  : (' '|'\t'|'\u000C')
    {$channel=HIDDEN;}
    ;

When I give this grammar the input 'First token here\nSecond token here', it matches the second_rule.

I would have expected it to match the first rule then any_left_over_tokens because the first_rule appears before the second_rule in the rule_order_list which is the start point. Can anyone explain why this happens?

Cheers

Answer

Bart Kiers picture Bart Kiers · Feb 4, 2011

First of all, ANTLR's lexer will tokenize the input from top to bottom. So tokens defined first have a higher precedence than the ones below it. And in case rule have overlapping tokens, the rule that matches the most characters will take precedence (greedy match).

The same principle holds within parser rules. Rules defined first will also be matched first. For example, in rule foo, sub-rule a will first be tried before b:

foo
  :  a
  |  b
  ;

Note that in your case, the 2nd rule isn't matched, but tries to do so, and fails because there is no trailing line break, producing the error:

line 0:-1 mismatched input '<EOF>' expecting NEW_LINE

So, nothing is matched at all. But that is odd. Because you've set the backtrack=true, it should at least backtrack and match:

  1. first_rule ("First token here")
  2. any_left_over_tokens ("line-break")
  3. any_left_over_tokens ("Second token here")

if not match first_rule in the first place and not even try to match second_rule to begin with.

A quick demo when doing the predicates manually (and disabling the backtrack in the options { ... } section) would look like:

grammar T;

options {
  output=AST;
  //backtrack=true;
  memoize=true;
}

rule_list_in_order
  :  ( (first_rule)=>  first_rule  {System.out.println("first_rule=[" + $first_rule.text + "]");}
     | (second_rule)=> second_rule {System.out.println("second_rule=[" + $second_rule.text + "]");}
     | any_left_over_tokens        {System.out.println("any_left_over_tokens=[" + $any_left_over_tokens.text + "]");}
     )+ 
  ;

first_rule
  :  FIRST_TOKEN
  ;

second_rule
  :  FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE
  ;

any_left_over_tokens
  :  NEW_LINE
  |  FIRST_TOKEN
  |  SECOND_TOKEN
  ;

FIRST_TOKEN  : 'First token here';   
SECOND_TOKEN : 'Second token here';
NEW_LINE     : ('\r'?'\n');
WS           : (' '|'\t'|'\u000C') {$channel=HIDDEN;};

which can be tested with the class:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "First token here\nSecond token here";
        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);
        parser.rule_list_in_order();
    }
}

which produces the expected output:

first_rule=[First token here]
any_left_over_tokens=[
]
any_left_over_tokens=[Second token here]

Note that it doesn't matter if you use:

rule_list_in_order
  :  ( (first_rule)=>  first_rule 
     | (second_rule)=> second_rule
     | any_left_over_tokens
     )+ 
  ;

or

rule_list_in_order
  :  ( (second_rule)=> second_rule // <--+--- swapped
     | (first_rule)=>  first_rule  // <-/
     | any_left_over_tokens
     )+ 
  ;

, both will produce the expected output.

So, my guess is that you may have found a bug.

Yout could try the ANTLR mailing-list, in case you want a definitive answer (Terence Parr frequents there more often than he does here).

Good luck!

PS. I tested this with ANTLR v3.2