Regex in Python

Andrea Ambu picture Andrea Ambu · Jun 15, 2009 · Viewed 8k times · Source

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.

I tried to solve this with regex, Helped by RegexBuddy I came to this one:

[\d]+([\d]{0,4}+[1-9])0*

But python can't compile that.

>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.5/re.py", line 241, in _compile
    raise error, v # invalid expression
sre_constants.error: multiple repeat

The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)

How can I write a working regex?

PS: I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.

Answer

Blixt picture Blixt · Jun 15, 2009

That regular expression is very superfluous. Try this:

>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")

The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:

>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")

Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)

Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.

I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.