How do you capture a group with regex?

Sylvain picture Sylvain · Apr 5, 2010 · Viewed 18.5k times · Source

I'm trying to extract a string from another using regex. I'm using the POSIX regex functions (regcomp, regexec ...), and I fail at capturing a group ...

For instance, let the pattern be something as simple as "MAIL FROM:<(.*)>"
(with REG_EXTENDED cflags)

I want to capture everything between '<' and '>'

My problem is that regmatch_t gives me the boundaries of the whole pattern (MAIL FROM:<...>) instead of just what's between the parenthesis ...

What am I missing ?

Thanks in advance,

edit: some code

#define SENDER_REGEX "MAIL FROM:<(.*)>"

int main(int ac, char **av)
{
  regex_t regex;
  int status;
  regmatch_t pmatch[1];

  if (regcomp(&regex, SENDER_REGEX, REG_ICASE|REG_EXTENDED) != 0)
    printf("regcomp error\n");
  status = regexec(&regex, av[1], 1, pmatch, 0);
  regfree(&regex);
  if (!status)
      printf(  "matched from %d (%c) to %d (%c)\n"
             , pmatch[0].rm_so
             , av[1][pmatch[0].rm_so]
             , pmatch[0].rm_eo
             , av[1][pmatch[0].rm_eo]
            );

  return (0);
}

outputs:

$./a.out "012345MAIL FROM:<abcd>$"
matched from 6 (M) to 22 ($)

solution:

as RarrRarrRarr said, the indices are indeed in pmatch[1].rm_so and pmatch[1].rm_eo
hence regmatch_t pmatch[1]; becomes regmatch_t pmatch[2];
and regexec(&regex, av[1], 1, pmatch, 0); becomes regexec(&regex, av[1], 2, pmatch, 0);

Thanks :)

Answer

Ian Mackinnon picture Ian Mackinnon · Aug 8, 2012

Here's a code example that demonstrates capturing multiple groups.

You can see that group '0' is the whole match, and subsequent groups are the parts within parentheses.

Note that this will only capture the first match in the source string. Here's a version that captures multiple groups in multiple matches.

#include <stdio.h>
#include <string.h>
#include <regex.h>

int main ()
{
  char * source = "___ abc123def ___ ghi456 ___";
  char * regexString = "[a-z]*([0-9]+)([a-z]*)";
  size_t maxGroups = 3;

  regex_t regexCompiled;
  regmatch_t groupArray[maxGroups];

  if (regcomp(&regexCompiled, regexString, REG_EXTENDED))
    {
      printf("Could not compile regular expression.\n");
      return 1;
    };

  if (regexec(&regexCompiled, source, maxGroups, groupArray, 0) == 0)
    {
      unsigned int g = 0;
      for (g = 0; g < maxGroups; g++)
        {
          if (groupArray[g].rm_so == (size_t)-1)
            break;  // No more groups

          char sourceCopy[strlen(source) + 1];
          strcpy(sourceCopy, source);
          sourceCopy[groupArray[g].rm_eo] = 0;
          printf("Group %u: [%2u-%2u]: %s\n",
                 g, groupArray[g].rm_so, groupArray[g].rm_eo,
                 sourceCopy + groupArray[g].rm_so);
        }
    }

  regfree(&regexCompiled);

  return 0;
}

Output:

Group 0: [ 4-13]: abc123def
Group 1: [ 7-10]: 123
Group 2: [10-13]: def