How can I split a string into tokens?

Martin Thetford picture Martin Thetford · Aug 19, 2013 · Viewed 23.6k times · Source

If I have a string

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

In an ideal world, e and E would not be recognised as letters in the same way, so

'-4e1'

would become

['-', '4e1']

but

'-4x1'

would become

['-', '4', 'x', '1']

Can anybody help?

Answer

Peter Varo picture Peter Varo · Aug 19, 2013

Use the regular expression module's split() function, to split at

  • '\d+' -- digits (number characters) and
  • '\W+' -- non-word characters:

CODE:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

  • [\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5

CODE:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']