Once grammar is complete, what's the best way to walk an ANTLR v4 tree?

Nucleon picture Nucleon · Feb 24, 2013 · Viewed 20.1k times · Source

Goal

I'm working on a project to create a Varscoper for Coldfusion CFscript. Basically, this means checking through source code files to ensure that developers have properly var'd their variables.

After a couple of days of working with ANTLR V4 I have a grammar which generates a very nice parse tree in the GUI view. Now, using that tree I need a way to crawl up and down the nodes programmatically looking for variable declarations and ensure that if they are inside functions they have the proper scoping. If possible I would rather NOT do this in the grammar file as that would require mixing the definition of the language with this specific task.

What I've tried

My latest attempt was using the ParserRuleContext and attempting to go through it's children via getPayload(). After checking the class of getPayLoad() I would either have a ParserRuleContext object or a Token object. Unfortunately, using that I was never able to find a way to get the actual rule type for a specific node, only it's containing text. The rule type for each node is neccessary because it matters whether that text node is an ignored right-hand expression, a variable assignment or a function declaration.

Questions

  1. I am very new to ANTLR, is this even the right approach, or is there a better way to traverse the tree?

Here's my sample java code:

Cfscript.java

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.Trees;

public class Cfscript {
    public static void main(String[] args) throws Exception {
        ANTLRInputStream input = new ANTLRFileStream(args[0]);
        CfscriptLexer lexer = new CfscriptLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        CfscriptParser parser = new CfscriptParser(tokens);
        parser.setBuildParseTree(true);
        ParserRuleContext tree = parser.component();
        tree.inspect(parser); // show in gui
        /*
            Recursively go though tree finding function declarations and ensuring all variableDeclarations are varred
            but how?
        */
    }
}

Cfscript.g4

grammar Cfscript;

component
    : 'component' keyValue* '{' componentBody '}'
    ;

componentBody
    : (componentElement)*
    ;

componentElement
    : statement
    | functionDeclaration
    ;

functionDeclaration
    : Identifier? Identifier? 'function' Identifier argumentsDefinition '{' functionBody '}'
    ;

argumentsDefinition
    : '(' argumentDefinition (',' argumentDefinition)* ')'
    | '()'
    ;

argumentDefinition
    : Identifier? Identifier? argumentName ('=' expression)?
    ; 

argumentName
    : Identifier
    ;

functionBody
    : (statement)*
    ;

statement
    : variableStatement
    | nonVarVariableStatement
    | expressionStatement
    ;

variableStatement
    : 'var' variableName '=' expression ';'
    ;

nonVarVariableStatement
    : variableName '=' expression ';'
    ;

expressionStatement
    : expression ';'
    ;

expression
    : assignmentExpression
    | arrayLiteral
    | objectLiteral
    | StringLiteral
    | incrementExpression
    | decrementExpression
    | 'true' 
    | 'false'
    | Identifier
    ;

incrementExpression
    : variableName '++'
    ;

decrementExpression
    : variableName '--'
    ;

assignmentExpression
    : Identifier (assignmentExpressionSuffix)*
    | assignmentExpression (('+'|'-'|'/'|'*') assignmentExpression)+
    ;

assignmentExpressionSuffix
    : '.' assignmentExpression
    | ArrayIndex
    | ('()' | '(' expression (',' expression)* ')' )
    ;

methodCall
    : Identifier ('()' | '(' expression (',' expression)* ')' )
    ;

variableName
    : Identifier (variableSuffix)*
    ;

variableSuffix
    : ArrayIndex
    | '.' variableName
    ;

arrayLiteral
    : '[' expression (',' expression)* ']'
    ;

objectLiteral
    : '{' (Identifier '=' expression (',' Identifier '=' expression)*)? '}'
    ;

keyValue
    : Identifier '=' StringLiteral
    ;

StringLiteral
    :  '"' (~('\\'|'"'))* '"'
    ;

 ArrayIndex
    : '[' [1-9] [0-9]* ']'
    | '[' StringLiteral ']'
    ;

Identifier
    : [a-zA-Z0-9]+
    ;

WS
    : [ \t\r\n]+ -> skip 
    ;

COMMENT 
    : '/*' .*? '*/'  -> skip
    ;

Test.cfc (testing code file)

component something = "foo" another = "more" persistent = "true" datasource = "#application.env.dsn#" {
    var method = something.foo.test1;
    testing = something.foo[10];
    testingagain = something.foo["this is a test"];
    nuts["testing"]++;
    blah.test().test3["test"]();

    var math = 1 + 2 - blah.test().test4["test"];

    var test = something;
    var testing = somethingelse;
    var testing = { 
        test = more, 
        mystuff = { 
            interior = test 
        },
        third = "third key"
    };
    other = "Idunno homie";
    methodCall(interiorMethod());

    public function bar() {
        var new = "somebody i used to know";
        something = [1, 2, 3];
    }

    function nuts(required string test1 = "first", string test = "second", test3 = "third") {

    }

    private boolean function baz() {
        var this = "something else";
    }
}

Answer

Bart Kiers picture Bart Kiers · Feb 24, 2013

I wouldn't walk this manually if I were you. After generating a lexer and parser, ANTLR would also have generated a file called CfscriptBaseListener that has empty methods for all of your parser rules. You can let ANTLR walk your tree and attach a custom tree-listener in which you override only those methods/rules you're interested in.

In your case, you probably want to be notified whenever a new function is created (to create a new scope) and you'll probably be interested in variable assignments (variableStatement and nonVarVariableStatement). Your listener, let's call is VarListener will keep track of all scopes as ANTLR walks the tree.

I did change 1 rule slightly (I added objectLiteralEntry):

objectLiteral
    : '{' (objectLiteralEntry (',' objectLiteralEntry)*)? '}'
    ;

objectLiteralEntry
    : Identifier '=' expression
    ;
    

which makes life easier in the following demo:

VarListener.java

public class VarListener extends CfscriptBaseListener {

    private Stack<Scope> scopes;

    public VarListener() {
        scopes = new Stack<Scope>();
        scopes.push(new Scope(null));
    } 

    @Override
    public void enterVariableStatement(CfscriptParser.VariableStatementContext ctx) {
        String varName = ctx.variableName().getText();
        Scope scope = scopes.peek();
        scope.add(varName);
    }

    @Override
    public void enterNonVarVariableStatement(CfscriptParser.NonVarVariableStatementContext ctx) {
        String varName = ctx.variableName().getText();
        checkVarName(varName);
    }

    @Override
    public void enterObjectLiteralEntry(CfscriptParser.ObjectLiteralEntryContext ctx) {
        String varName = ctx.Identifier().getText();
        checkVarName(varName);
    }

    @Override
    public void enterFunctionDeclaration(CfscriptParser.FunctionDeclarationContext ctx) {
        scopes.push(new Scope(scopes.peek()));
    }

    @Override
    public void exitFunctionDeclaration(CfscriptParser.FunctionDeclarationContext ctx) {
        scopes.pop();        
    }

    private void checkVarName(String varName) {
        Scope scope = scopes.peek();
        if(scope.inScope(varName)) {
            System.out.println("OK   : " + varName);
        }
        else {
            System.out.println("Oops : " + varName);
        }
    }
}

A Scope object could be as simple as:

Scope.java

class Scope extends HashSet<String> {

    final Scope parent;

    public Scope(Scope parent) {
        this.parent = parent;
    }

    boolean inScope(String varName) {
        if(super.contains(varName)) {
            return true;
        }
        return parent == null ? false : parent.inScope(varName);
    }
}

Now, to test this all, here's a small main class:

Main.java

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;

public class Main {

    public static void main(String[] args) throws Exception {

        CfscriptLexer lexer = new CfscriptLexer(new ANTLRFileStream("Test.cfc"));
        CfscriptParser parser = new CfscriptParser(new CommonTokenStream(lexer));
        ParseTree tree = parser.component();
        ParseTreeWalker.DEFAULT.walk(new VarListener(), tree);
    }
}

If you run this Main class, the following will be printed:

Oops : testing
Oops : testingagain
OK   : test
Oops : mystuff
Oops : interior
Oops : third
Oops : other
Oops : something

Without a doubt, this is not exactly what you want and I probably goofed up some scoping rules of Coldfusion. But I think this will give you some insight in how to solve your problem properly. I think the code is pretty self explanatory, but if this is not the case, don't hesitate to ask for clarification.

HTH