Generate an AST in C++

Francis picture Francis · Feb 15, 2015 · Viewed 12.2k times · Source

I'm making an interpreter in C++, so far I've got my lexer to generate tokens. The problem is I'm not sure how to generate an "walk" a parse tree.

I was thinking of making my parse tree using an array of arrays, but I'm not sure how to actually insert the tokens into the parse tree in the correct order.

I'm not sure whether or not to go top-down, left-right or bottom-up, right-left.

Can anyone provide me with a simple LALR(1) algorithm?

Answer

Cameron picture Cameron · Feb 15, 2015

I'm going to go against conventional wisdom here and say that you should build your AST manually with natural language-specific data-structures.

A generic "AST data structure" will be too general to work with comfortably -- the code that consumes the AST to do anything useful with it will be obscured by the workarounds to access the data it wants. If you go this route (or use a parser generator), you'll likely end up building a translation layer to go from the generic structure to an AST that actually makes sense for your language. Why not avoid the middleman and build the final data structure directly?

I suggest writing a first syntactic pass that consumes the tokens it needs for each possible construct (probably peeking ahead one token as needed). This syntactic pass would construct the AST out of linked instances of data structures that represent each possible construct in your language. For example, if your program can consist of a series of statements, where each statement can be either a function declaration or a function call, you could create the following data structures:

AST (struct)
    -> std::vector<Statement*> statements

StatementKind (enum class)
    -> Function
    -> Call

Statement (struct)
    -> StatementKind kind

Function : Statement (struct)
    -> std::string name
    -> std::vector<Statement*> body

Call : Statement (struct)
    -> std::string name
    -> Function* called

Then the syntactic pass to build the initial AST might look like:

auto ast = parseAST();

where parseAST calls parseStatement repeatedly, which consumes and/or peeks at tokens to determine whether the statement is a function definition or a function call, then calls parseFunction or parseCall appropriately. This is called a hand-written recursive descent parser, and is in my experience by far the simplest and best type of parser you can write. Yes there is boilerplate to write, but it's also very clear and flexible code.

The syntactic phase generates syntax error messages as it goes, attempting to recover as best as it can when an error is found (this is one argument for semicolon-delimited languages -- the compilers and tools can often recover from errors much more easily, since it often suffices to skip to the next semicolon when an error is encountered before trying to parse the next construct).

If a function is allowed to be called before it is defined, this is simple to handle too -- just parse the call part without checking if a function with that name exists at the time you first parse it, then correlate it later.

A second, semantic, pass through the AST would annotate it with any missing cross-referenced data (e.g. resolving function calls' function names with those functions' definitions, or generating an error if the name is not found). Once that's done, you have a full AST that's guaranteed to represent a valid program (since you did all the semantic error checking in this pass), and can have optimizations applied to it, followed by the code generation phase (and more optimizations along the way).

Note that I completely left out any mention of LL or LR parsers, etc. The theoretical notation and categorization of your language is important (as it can prove it's non-ambiguous, for example), but from an implementation point of view, it's far more important to have clean, easily understood and above all easily modifiable code than to adhere to a specific parsing mechanism. Examples of other compilers that use hand-written parsers include GCC, Clang, and Google's V8, among others.

In terms of efficiency, the AST is generally iterated over several times after it's constructed, and needs to stick around in memory until late in the process (possibly right until the end, depending on how you do your code generation), and so it's best to use an object pool to allocate the records for your AST than to dynamically allocate everything separately on the heap. Finally, in place of strings, it's generally better to use an offset-and-length into the original source code.