tags:

views:

510

answers:

1

I'm trying to match the following items in the string pcode:

  • u followed by a 1 or 2 digit number
  • phaseu
  • phasep
  • x (surrounded by non-word chars)
  • y (surrounded by non-word chars)
  • z (surrounded by non-word chars)

I've tried to implement a regex match using the POSIX regex functions (shown below), but have two problems:

  1. The compiled pattern seems to have no subpatterns (i.e. compiled.n_sub == 0).
  2. The pattern doesn't find matches in the string " u0", which it really should!

I'm confident that the regex string itself is working—in that it works in python and TextMate—my problem lies with the compilation, etc. in C. Any help with getting that working would be much appreciated.

Thanks in advance for your answers.

if(idata=tb_find(deftb,pdata)){
    MESSAGE("Global variable!\n");
    char pattern[80] = "((u[0-9]{1,2})|(phaseu)|(phasep)|[\\W]+([xyz])[\\W]+)";
    MESSAGE("Pattern = \"%s\"\n",pattern);
    regex_t compiled;
    if(regcomp(&compiled, pattern, 0) == 0){
        MESSAGE("Compiled regular expression \"%s\".\n", pattern);
    }

    int nsub = compiled.re_nsub;
    MESSAGE("nsub = %d.\n",nsub);
    regmatch_t matchptr[nsub];
    int err;
    if(err = regexec (&compiled, pcode, nsub, matchptr, 0)){
        if(err == REG_NOMATCH){
            MESSAGE("Regular expression did not match.\n");
        }else if(err == REG_ESPACE){
            MESSAGE("Ran out of memory.\n");
        }
    }
    regfree(&compiled);
}
+8  A: 

It seems you intend to use something resembling the "extended" POSIX regex syntax. POSIX defines two different regex syntaxes, a "basic" (read "obsolete") syntax and the "extended" syntax. To use the extended syntax, you need to add the REG_EXTENDED flag for regcomp:

...
if(regcomp(&compiled, pattern, REG_EXTENDED) == 0){
...

Without this flag, regcomp will use the "basic" regex syntax. There are some important differences, such as:

  • No support for the | operator
  • The brackets for submatches need to be escaped, \( and \)

It should be also noted that the POSIX extended regex syntax is not 1:1 compatible with Python's regex (don't know about TextMate). In particular, I'm afraid this part of your regexp does not work in POSIX, or at least is not portable:

 [\\W]

The POSIX way to specify non-space characters is:

 [^[:space:]]

Your whole regexp for POSIX should then look like this in C:

 char *pattern = "((u[0-9]{1,2})|(phaseu)|(phasep)|[^[:space:]]+([xyz])[^[:space:]]+)";
Ville Laurikari
Thanks Ville! That did the trick beautifully.Can you tell me if there is an equivalent for OR (|), or should I just compile and match multiple expressions?
rossmcf
The POSIX extended syntax has support for |. I've edited my post to include a regex which should do what you need (as long as you use REG_EXTENDED).
Ville Laurikari
You star! You've saved me from an afternoon of swearing and desk-thumping… Much appreciated.
rossmcf