ansaurus

Question

C :: using sscanf to parse string, caveat: context sensitive :-(

Answer 1

A:

sscanf() doesn't support regular expressions?

Oli Charlesworth 2010-08-24 10:49:40

No, it doesn't.

DinGODzilla 2010-08-24 11:54:58

Answer 2

+2 A:

Short answer: This is not solvable with sscanf, because sscanf cannot backtrack.

At least you cannot do it with just one sscanf call. Try something like

if (sscanf(str, " $%124[a-zA-Z0-9_-]", _param1) != 1) DO_FAIL;
size_t _param1_len = strlen(_param1);
if (_param1[_param1_len-1] == '-') {
  _param[_param1_len-1] = '\0';
  _param1_len -= 1;
}
// parse rest '- $param2'
if (sscanf(str+_param1_len, ...

Idea is to parse just one token at time. You could implement identifier parsing as own function so you can reuse it, as you probably want to parse something looking like "$foo + $bar".

phadej 2010-08-24 10:51:30

Answer 3

A:

I think, there is no easy way to do this with sscanf. sscanf is NO replacement for regexp. Shorter should be a selfmade-solution here:

char *t,input[]="$my-param1-$my-param2";
if( (t=strstr(input,"-$")!=0 || (t=strstr(input,"_$")!=0 )
{
  *t=0;
  strcpy(param1,input);
  strcpy(param2,t+1);
}

OK, with spaces between tokens its also easy:

char *t,*t1,input[]=" $my-param1 -  $my-param2 ";
if( (t=strchr(input,'$'))!=0 && (t1=strchr(t+2,'$'))!=0 )
{
  *--t1=0;
  while( t1>t+2 && strchr(" -_",*(t1-1)) )
    *--t1=0;
  while( !*t1 ) ++t1;
  while( *t1 && strchr(" -_",t1[strlen(t1)-1]) )
    t1[strlen(t1)-1]=0;
  strcpy(param1,t);
  strcpy(param2,t1);
}

2010-08-24 10:51:54

Thanks. Problem is with "$par1 - $par2", "$par1- $par2", "$par1 - $par2", and so on. (there are more spaces, hope 'comment' don't strip it)

DinGODzilla 2010-08-24 11:56:10

Well, it DID trimmed it, imagine more spaces before/after '-'.

DinGODzilla 2010-08-24 11:57:46

Answer 4

+1 A:

You appear familiar with the regular expressions. If you are on POSIX platform, why not to use the regcomp()/regexec()/regfree()? Or the PCRE which is also available as a DLL for Windows?

I generally avoid using sscanf() for anything more complicated than reading numbers or strings. Otherwise I either code a mini FSM (consuming string char by char) or use the regular expressions.

Dummy00001 2010-08-24 15:29:21

Thanks, Dummy00001. I'm under linux (GCC), using regcomp/regexec/regfree for validation that parsing right expression. PCRE cannot be used (client don't allow it). C really sucks in regexp. Problem is that its generated (in Groovy) C source. More than 25k lines (~2MB of sources), must be fast and sscanf is best simple tool to do that.

DinGODzilla 2010-08-24 16:02:04

The task IMO is simply too complicated to fit into the simplistic semantic of the `sscanf()`. That's the main point of my comment. If performance is a goal, I would have coded a generator of simple FSMs. Switch/case based FSMs are rather easy to generate and test. And BTW, POSIX regexps are not that slow. Simple regexp would work only tad bit slower than the `sscanf()`.

Dummy00001 2010-08-24 16:23:31

I have tested regexec v. sscanf performance on my Linux. 1Mln `sscanfs("%[^ ] %[^ ]")` ~270ms; `regcomp( "^[^ ]\\+ [^ ]\\+$" ) + regexec()` ~ 410ms. And that's for 1'000'000 matches. Totally not a performance problem.

Dummy00001 2010-08-24 16:46:23

Answer 5

A:

It seems using sscanf is not the easiest solution, because sscanf alone will not cope with such tokens.

However, parsing such a string character by character is very simple.

You need a function which will look ahead and tell where the token ends:

char *token_end(char *s)
{
    int specials = 0;
    for (; *s != '\0'; ++s) {
        if (*s == '_' || *s == '-')
            ++specials;
        else if (isalnum(*s))
            specials = 0;
        else
            break;
    }
    return s - specials;
}

It is passed a pointer to the first character after a found '$' and returns a pointer to the first character after the token.

Now, parse the string character by character. If it's a '$', use token_end to find where the token ends, and continue from its end; otherwise, the character doesn't belong to a token:

/* str is a pointer to a NULL-terminated string */
char *p = str;
while (*p != '\0') {
    if (*p == '$') {
        char *beg = p;
        char *end = token_end(p+1);
        if (end - beg > 1) {
            /* here, beg points to the '$' of the current token,
             * and end to the character just after the token */
            printf("token(%li,%li)", beg - str, end - str);
            /* parse the token, save it, etc... */
            p = end;
            continue;
        }
    }
    /* do something with a character which does not belong to a token... */
    printf("%c", *p);
    ++p;
}

Michał Trybus 2010-08-24 17:02:27

ansaurus

tags:

views:

answers:

C :: using sscanf to parse string, caveat: context sensitive :-(

Problem is this (simplified):

related questions