Writing a syntax highlighter

Ian Dallas picture Ian Dallas · May 1, 2009 · Viewed 13.9k times · Source

I was hoping to write my own syntax highlighter for a summer project I am thinking of working on but I am not sure how to write my own syntax highlighter.

I know that there are bunch of implementations out there but I would like to learn about regular expressions and how syntax highlighting works.

How does syntax highlighting work and what are some good references for developing one? Does the syntax highlighter scan each character as it is typed or does it scan the document/text area as a whole after each character is typed?

Any insight would be greatly appreciated.

Thanks.

PS: I was planning on writing it in ActionScript

Answer

David Anderson picture David Anderson · May 1, 2009

Syntax highlighters can work in two very general ways. The first implements a full lexer and parser for the language(s) being highlighted, exactly identifying each token's type (keyword, class name, instance name, variable type, preprocessor directive...). This provides all the information needed to exactly highlight the code according to some specification (keywords in red, class names in blue, what have you).

The second way is something like the one Google Code Prettify employs, where instead of implementing one lexer/parser per language, a couple of very general parsers are used that can do a decent job on most syntaxes. This highlighter, for example, will be able to parse and highlight reasonably well any C-like language, because its lexer/parser can identify the general components of those kinds of languages.

This also has the advantage that, as a result, you don't need to explicitely specify the language, as the engine will determine by itself which of its generic parsers can do the best job. The downside of course is that highlighting is less perfect than when a language-specific parser is used.