(you can skip to What if... if you get bored with intros)
This question is not directed to VBScript particularly (I just used it in this case): I want to find a solution for general regular expressions usage (editors included).
This started when I wanted to create an adaptation of Example 4 where 3 capture groups are used to split data across 3 cells in MS Excel. I needed to capture one entire pattern and then, within it, capture 3 other patterns. However, in the same expression, I also needed to capture another kind of pattern and again capture 3 other patterns within it (yeah I know... but before pointing the nutjob finger, please finish reading).
I thought first of Named Capturing Groups then I realized that I should not «mix named and numbered capturing groups» since it «is not recommended because flavors are inconsistent in how the groups are numbered».
Then I looked into VBScript SubMatches and «non-capturing» groups and I got a working solution for a specific case:
For Each C In Myrange
strPattern = "(?:^([0-9]+);([0-9]+);([0-9]+)$|^.*:([0-9]+)\s.*:([0-9]+).*:([a-zA-Z0-9]+)$)"
If strPattern <> "" Then
strInput = C.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
Set rgxMatches = regEx.Execute(strInput)
For Each mtx In rgxMatches
If mtx.SubMatches(0) <> "" Then
C.Offset(0, 1) = mtx.SubMatches(0)
C.Offset(0, 2) = mtx.SubMatches(1)
C.Offset(0, 3) = mtx.SubMatches(2)
ElseIf mtx.SubMatches(3) <> "" Then
C.Offset(0, 1) = mtx.SubMatches(3)
C.Offset(0, 2) = mtx.SubMatches(4)
C.Offset(0, 3) = mtx.SubMatches(5)
Else
C.Offset(0, 1) = "(Not matched)"
End If
Next
End If
Next
Here's a demo in Rubular of the regex. In these:
124;12;3
my id1:213 my id2:232 my word:ins4yanrgx
:8587459 :18254182540215 :dcpt
0;1;2
It returns the first 2 cells with numbers and the 3rd with a number or a word. Basically I used a non-capturing group with 2 "parent" patterns ("parents" = broad patterns where I want to detect other sub-patterns). If the 1st parent pattern has a matching sub-pattern (1st capture group) then I place its value and the remaining captured groups of this pattern in the 3 cells. If not, I check if the 4th capture group (belonging to the 2nd parent pattern) was matched and place the remaining sub-patterns in the same 3 cells.
Instead of having something like this:
(?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))
Something like this could be possible:
(#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))
Where (#:
instead of creating a non-capturing group, would create a "parent" numbered capture group.
In this way I could do something similar to Example 4:
C.Offset(0, 1) = regEx.Replace(strInput, "#$1")
C.Offset(0, 2) = regEx.Replace(strInput, "#$2")
C.Offset(0, 3) = regEx.Replace(strInput, "#$3")
It would search parent patterns until it finds a match in a child pattern (the first match would be returned and, ideally, wouldn't search the remaining ones).
Is there something like this already? Or am I missing something entirely from regex that allows to do this?
Other possible variations:
#2$3
(this would be equivalent of $6
in my example);(#:^_(?:(#:(\d+):\w+-(\d))|(#:\w+:(\d+)-(\d+)))_$)|(#:^\w+:\s+(#:(\w+);\d-(\d+))$)
and fetching ##$1
in patterns like:
_123:smt-4_
it would match in: 123
_ott:432-10_
it would match in: 432
yant: special;3-45235
it would match in: special
Please tell me if you noticed any mistakes or flaws in this logic, I will edit asap.
This is usually the case where mostly the same data is to be captured.
The only difference is in form.
There is a regex construct for that called Branch Reset.
Its offered on most Perl compatible engine's. Not Java nor Dot Net.
It mostly just saves regex resources and makes it easier to handle matches.
The alternative you mention will not help in any way, it actually just uses
more resources. You still have to see what matched to see where you are.
But you only have to check one group within a cluster to tell which other
groups are valid (<- this is unnecessary if using branch reset).
(below was constructed using RegexFormat 6)
Here is the branch reset version:
# (?|^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever)()())
(?|
^
( \d+ ) # (1)
;
( \d+ ) # (2)
;
( \d+ ) # (3)
$
|
^ .* :
( \d+ ) # (1)
\s .* :
( \d+ ) # (2)
.* :
( \w+ ) # (3)
$
|
what
( ever ) # (1)
( ) # (2)
( ) # (3)
)
Here is your two regexes. Notice the 'parent' capturing actually increases the number of groups (which slows down the engine):
# (?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))
(?:
^
( \d+ ) # (1)
;
( \d+ ) # (2)
;
( \d+ ) # (3)
$
|
^ .* :
( \d+ ) # (4)
\s .* :
( \d+ ) # (5)
.* :
( \w+ ) # (6)
$
|
what
( ever ) # (7)
)
and
# (#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))
( # (1 start)
\#: ^
( \d+ ) # (2)
;
( \d+ ) # (3)
;
( \d+ ) # (4)
$
) # (1 end)
|
( # (5 start)
\#: ^ .* :
( \d+ ) # (6)
\s .* :
( \d+ ) # (7)
.* :
( \w+ ) # (8)
$
) # (5 end)
|
( # (9 start)
\#:what
( ever ) # (10)
) # (9 end)