I'm trying to learn ANTLR and at the same time use it for a current project.
I've gotten to the point where I can run the lexer on a chunk of code and output it to a CommonTokenStream. This is working fine, and I've verified that the source text is being broken up into the appropriate tokens.
Now, I would like to be able to modify the text of certain tokens in this stream, and display the now modified source code.
For example I've tried:
import org.antlr.runtime.*;
import java.util.*;
public class LexerTest
{
public static final int IDENTIFIER_TYPE = 4;
public static void main(String[] args)
{
String input = "public static void main(String[] args) { int myVar = 0; }";
CharStream cs = new ANTLRStringStream(input);
JavaLexer lexer = new JavaLexer(cs);
CommonTokenStream tokens = new CommonTokenStream();
tokens.setTokenSource(lexer);
int size = tokens.size();
for(int i = 0; i < size; i++)
{
Token token = (Token) tokens.get(i);
if(token.getType() == IDENTIFIER_TYPE)
{
token.setText("V");
}
}
System.out.println(tokens.toString());
}
}
I'm trying to set all Identifier token's text to the string literal "V".
Why are my changes to the token's text not reflected when I call tokens.toString()?
How am I suppose to know the various Token Type IDs? I walked through with my debugger and saw that the ID for the IDENTIFIER tokens was "4" (hence my constant at the top). But how would I have known that otherwise? Is there some other way of mapping token type ids to the token name?
EDIT:
One thing that is important to me is I wish for the tokens to have their original start and end character positions. That is, I don't want them to reflect their new positions with the variable names changed to "V". This is so I know where the tokens were in the original source text.
ANTLR has a way to do this in it's grammar file.
Let's say you're parsing a string consisting of numbers and strings delimited by comma's. A grammar would look like this:
grammar Foo;
parse
: value ( ',' value )* EOF
;
value
: Number
| String
;
String
: '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
;
Number
: '0'..'9'+
;
Space
: ( ' ' | '\t' ) {skip();}
;
This should all look familiar to you. Let's say you want to wrap square brackets around all integer values. Here's how to do that:
grammar Foo;
options {output=template; rewrite=true;}
parse
: value ( ',' value )* EOF
;
value
: n=Number -> template(num={$n.text}) "[<num>]"
| String
;
String
: '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
;
Number
: '0'..'9'+
;
Space
: ( ' ' | '\t' ) {skip();}
;
As you see, I've added some options
at the top, and added a rewrite rule (everything after the ->
) after the Number
in the value
parser rule.
Now to test it all, compile and run this class:
import org.antlr.runtime.*;
public class FooTest {
public static void main(String[] args) throws Exception {
String text = "12, \"34\", 56, \"a\\\"b\", 78";
System.out.println("parsing: "+text);
ANTLRStringStream in = new ANTLRStringStream(text);
FooLexer lexer = new FooLexer(in);
CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
FooParser parser = new FooParser(tokens);
parser.parse();
System.out.println("tokens: "+tokens.toString());
}
}
which produces:
parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]