Best way to tokenize and parse programming languages in my application

ChrisDiRulli picture ChrisDiRulli · Feb 12, 2010 · Viewed 9.1k times · Source

I'm working on a tool that will perform some simple transformations on programs (like extract method). To do this, I will have to perform the first few steps of compilation (tokenizing, parsing and possibly building a symbol table). I'm going to start with C and then hopefully extend this out to support multiple languages.

My question is, what's the best way to perform these steps that:

1.) Doesn't reinvent the wheel. Clearly I don't want to write Flex/Bison specifications by hand. Do I just grab pre-existing specifications and work from there? Is Antlr the way to go here?

2.) Is extendable to multiple languages. Obviously lexing/parsing will be different for everybody, but I would like a solution that I could easily extend to other languages. At least a set of technologies that would make this manageable.

BTW, I'm using C to write my applications

If anyone has any ideas that would be great! Thanks!

Answer

user177800 picture user177800 · Feb 12, 2010

hands down the best way to do any parsing is ANTLR. There are two great books on the subject by the author that are must haves. The Definitive ANTLR Reference: Building Domain Specific Languages, and Language Implementation Patterns, both are invaluable resources. ANTLR can generate processing code in lots of different languages.