How to write a Parser in C#?

ApprenticeHacker picture ApprenticeHacker · Sep 11, 2011 · Viewed 77.9k times · Source

How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).

Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?

Answer

Jonathan Dickinson picture Jonathan Dickinson · Sep 11, 2011

I have implemented several parsers in C# - hand-written and tool generated.

A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.

You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).

On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.

Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).