Problem with whitespace in a RegEx with capture groups

Dov picture Dov · Mar 18, 2011 · Viewed 11k times · Source

I've got a regular expression that I'm trying to match against the following types of data, with each token separated by an unknown number of spaces.

Update: "Text" can be almost any character, which is why I had .* initially. Importantly, it can also include spaces.

  1. Text
  2. Text 01
  3. Text 01 of 03
  4. Text 01 (of 03)
  5. Text 01-03

I'd like to capture "Text", "01", and "03" as separate groups, and all except "Text" are optional. The best I've been able to do so far is:

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)

This matches #3-#5, and puts them in the proper capture groups. I can't figure out, though, why when I add an additional ? to the end to make the part of the expression after 01 optional, my capture groups get all funky.

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)?

The RegEx above matches #2-#5, but the capture groups are correct only for #2 and #5.

This seems like a straightforward regular expression, so I don't know why I'm having so much trouble with it.

This is a link to an online RegEx evaluator I'm using to help me debug this: http://regexr.com?2tb64. The link already has the first RegEx and the test data filled in.

Answer

ridgerunner picture ridgerunner · Mar 19, 2011

You didn't say which regex tool you are using so I am assuming the least common denominator i.e. Javascript. Here is one that works:

var re = /^\s*(.+?)(?:\s+(\d+)(?:(?:\s+\(?of\s+|-)(\d+)\)?)?)?$/i;

To make this work in your Regexr tool, be sure to turn on the "multi-line option".

Here it the same thing in PHP syntax (with lots of juicy comments!):

$re = '/ # Always write non-trivial regex in free-space mode!
    ^                  # Anchor to start of string.
    \s*                # optional leading whitspace is ok.
    (.+?)              # Text can be pretty much anything.
    (?:                # Group to allow applying ? quantifier
      \s+              # WS separates "Text" from first number.
      (\d+)            # First number.
      (?:              # Group to allow applying ? quantifier
        (?:            # Second number prefix alternatives
          \s+\(?of\s+  # Either " of 03" and " (of 03)",
        | -            # or just a dash  for "-03" case.
        )              # End second number prefix alternatives
        (\d+)          # Second number
        \)?            # Match ")" for " (of 03)" case.
      )?               # Second number is optional.
    )?                 # First numebr is optional.
    $                  # Anchor to start of string.
    /ix';