tags:

views:

130

answers:

1

Hi,

I'm trying to extract a string from another using regex. I'm using the POSIX regex functions (regcomp, regexec ...), and I fail at capturing a group ...

For instance, let the pattern be something as simple as "MAIL FROM:<(.*)>"
(with REG_EXTENDED cflags)

I want to capture everything between '<' and '>'

My problem is that regmatch_t gives me the boundaries of the whole pattern (MAIL FROM:<...>) instead of just what's between the parenthesis ...

What am I missing ?

Thanks in advance,

edit: some code

#define SENDER_REGEX "MAIL FROM:<(.*)>"

int main(int ac, char **av)
{
  regex_t regex;
  int status;
  regmatch_t pmatch[1];

  if (regcomp(&regex, SENDER_REGEX, REG_ICASE|REG_EXTENDED) != 0)
    printf("regcomp error\n");
  status = regexec(&regex, av[1], 1, pmatch, 0);
  regfree(&regex);
  if (!status)
      printf(  "matched from %d (%c) to %d (%c)\n"
             , pmatch[0].rm_so
             , av[1][pmatch[0].rm_so]
             , pmatch[0].rm_eo
             , av[1][pmatch[0].rm_eo]
            );

  return (0);
}

outputs:

$./a.out "012345MAIL FROM:<abcd>$"
matched from 6 (M) to 22 ($)

solution:

as RarrRarrRarr said, the indices are indeed in pmatch[1].rm_so and pmatch[1].rm_eo
hence regmatch_t pmatch[1]; becomes regmatch_t pmatch[2];
and regexec(&regex, av[1], 1, pmatch, 0); becomes regexec(&regex, av[1], 2, pmatch, 0);

Thanks :)

+2  A: 

The 0th element of the pmatch array of regmatch_t structs will contain the boundaries of the whole string matched, as you have noticed. In your example, you are interested in the regmatch_t at index 1, not at index 0, in order to get information about the string matches by the subexpression.

If you need more help, try editing your question to include an actual small code sample so that people can more easily spot the problem.

RarrRarrRarr
Thanks, that was the issue :)
Sylvain