I've got a regular expression that I'm trying to match against the following types of data, with each token separated by an unknown number of spaces.
Update: "Text" can be almost any character, which is why I had .*
initially. Importantly, it can also include spaces.
I'd like to capture "Text", "01", and "03" as separate groups, and all except "Text" are optional. The best I've been able to do so far is:
\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)
This matches #3-#5, and puts them in the proper capture groups. I can't figure out, though, why when I add an additional ?
to the end to make the part of the expression after 01
optional, my capture groups get all funky.
\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)?
The RegEx above matches #2-#5, but the capture groups are correct only for #2 and #5.
This seems like a straightforward regular expression, so I don't know why I'm having so much trouble with it.
This is a link to an online RegEx evaluator I'm using to help me debug this: http://regexr.com?2tb64. The link already has the first RegEx and the test data filled in.
You didn't say which regex tool you are using so I am assuming the least common denominator i.e. Javascript. Here is one that works:
var re = /^\s*(.+?)(?:\s+(\d+)(?:(?:\s+\(?of\s+|-)(\d+)\)?)?)?$/i;
To make this work in your Regexr tool, be sure to turn on the "multi-line option".
Here it the same thing in PHP syntax (with lots of juicy comments!):
$re = '/ # Always write non-trivial regex in free-space mode!
^ # Anchor to start of string.
\s* # optional leading whitspace is ok.
(.+?) # Text can be pretty much anything.
(?: # Group to allow applying ? quantifier
\s+ # WS separates "Text" from first number.
(\d+) # First number.
(?: # Group to allow applying ? quantifier
(?: # Second number prefix alternatives
\s+\(?of\s+ # Either " of 03" and " (of 03)",
| - # or just a dash for "-03" case.
) # End second number prefix alternatives
(\d+) # Second number
\)? # Match ")" for " (of 03)" case.
)? # Second number is optional.
)? # First numebr is optional.
$ # Anchor to start of string.
/ix';