Pythonic way to implement a tokenizer

Peter picture Peter · Mar 27, 2009 · Viewed 15.2k times · Source

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.

Listing Token Types:

In Java, for example, I would have a list of fields like so:

public static final int TOKEN_INTEGER = 0

But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.

Returning Tokens From The Tokenizer:

Is there a better alternative to just simply returning a list of tuples e.g.

[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?

Cheers,

Pete

Answer

AKX picture AKX · Mar 29, 2009

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.