tags:

views:

287

answers:

5

Hello, I've some string (char *) in C and using sscanf to tokenize it.

I'm generating C source-code and using sscanf is easiest solution, however there is this problem:

There is regular expression for parameter:

[$]([a-zA-Z0-9_-]{0,122})?[a-zA-Z0-9]

(Starting with $, can contain numbers, letters, '-' and '_', but later two can not be at the end of parameter name.)

i.e. :

$My_parameter1            //OK
$my-param-2               //OK
$_-this_-_is--my-_par     //OK
$My_parameter2-           //WRONG!
$My_parameter2_           //WRONG!

Problem is this (simplified):

char _param1 [125]; //string that matches parameter name
char _param2 [125]; //string that matches parameter name

if ( sscanf(str, " $%124[a-zA-Z0-9_-] - $%124[a-zA-Z0-9_-] ", _param1, _param2) != 2 )
    DO_FAIL;

When used on " $parameter_one - $param-two " it works (clearly).

Problem is obviously with "$param1-$param2", because sscanf tokenizes first item as '$param1-' and then fails to find '-'.

Can experienced C programmer see how to simply solve this?

i.e.:

char _param1 [125]; //string that matches parameter name
char _param2 [125]; //string that matches parameter name

??? ... ???    
sscanf("$my-param1-$my-param2", ??? ... ???)
??? ... ???

// _param1 == "$my-param1"     //resp. strcmp(_param1, "$my-param1") == 0
// _param2 == "$my-param2"

Thanks...

A: 

sscanf() doesn't support regular expressions?

Oli Charlesworth
No, it doesn't.
DinGODzilla
+2  A: 

Short answer: This is not solvable with sscanf, because sscanf cannot backtrack.

At least you cannot do it with just one sscanf call. Try something like

if (sscanf(str, " $%124[a-zA-Z0-9_-]", _param1) != 1) DO_FAIL;
size_t _param1_len = strlen(_param1);
if (_param1[_param1_len-1] == '-') {
  _param[_param1_len-1] = '\0';
  _param1_len -= 1;
}
// parse rest '- $param2'
if (sscanf(str+_param1_len, ...

Idea is to parse just one token at time. You could implement identifier parsing as own function so you can reuse it, as you probably want to parse something looking like "$foo + $bar".

phadej
A: 

I think, there is no easy way to do this with sscanf. sscanf is NO replacement for regexp. Shorter should be a selfmade-solution here:

char *t,input[]="$my-param1-$my-param2";
if( (t=strstr(input,"-$")!=0 || (t=strstr(input,"_$")!=0 )
{
  *t=0;
  strcpy(param1,input);
  strcpy(param2,t+1);
}

OK, with spaces between tokens its also easy:

char *t,*t1,input[]=" $my-param1 -  $my-param2 ";
if( (t=strchr(input,'$'))!=0 && (t1=strchr(t+2,'$'))!=0 )
{
  *--t1=0;
  while( t1>t+2 && strchr(" -_",*(t1-1)) )
    *--t1=0;
  while( !*t1 ) ++t1;
  while( *t1 && strchr(" -_",t1[strlen(t1)-1]) )
    t1[strlen(t1)-1]=0;
  strcpy(param1,t);
  strcpy(param2,t1);
}
Thanks. Problem is with "$par1 - $par2", "$par1- $par2", "$par1 - $par2", and so on. (there are more spaces, hope 'comment' don't strip it)
DinGODzilla
Well, it DID trimmed it, imagine more spaces before/after '-'.
DinGODzilla
+1  A: 

You appear familiar with the regular expressions. If you are on POSIX platform, why not to use the regcomp()/regexec()/regfree()? Or the PCRE which is also available as a DLL for Windows?

I generally avoid using sscanf() for anything more complicated than reading numbers or strings. Otherwise I either code a mini FSM (consuming string char by char) or use the regular expressions.

Dummy00001
Thanks, Dummy00001. I'm under linux (GCC), using regcomp/regexec/regfree for validation that parsing right expression. PCRE cannot be used (client don't allow it). C really sucks in regexp. Problem is that its generated (in Groovy) C source. More than 25k lines (~2MB of sources), must be fast and sscanf is best simple tool to do that.
DinGODzilla
The task IMO is simply too complicated to fit into the simplistic semantic of the `sscanf()`. That's the main point of my comment. If performance is a goal, I would have coded a generator of simple FSMs. Switch/case based FSMs are rather easy to generate and test. And BTW, POSIX regexps are not that slow. Simple regexp would work only tad bit slower than the `sscanf()`.
Dummy00001
I have tested regexec v. sscanf performance on my Linux. 1Mln `sscanfs("%[^ ] %[^ ]")` ~270ms; `regcomp( "^[^ ]\\+ [^ ]\\+$" ) + regexec()` ~ 410ms. And that's for 1'000'000 matches. Totally not a performance problem.
Dummy00001
A: 

It seems using sscanf is not the easiest solution, because sscanf alone will not cope with such tokens.

However, parsing such a string character by character is very simple.

You need a function which will look ahead and tell where the token ends:

char *token_end(char *s)
{
    int specials = 0;
    for (; *s != '\0'; ++s) {
        if (*s == '_' || *s == '-')
            ++specials;
        else if (isalnum(*s))
            specials = 0;
        else
            break;
    }
    return s - specials;
}

It is passed a pointer to the first character after a found '$' and returns a pointer to the first character after the token.

Now, parse the string character by character. If it's a '$', use token_end to find where the token ends, and continue from its end; otherwise, the character doesn't belong to a token:

/* str is a pointer to a NULL-terminated string */
char *p = str;
while (*p != '\0') {
    if (*p == '$') {
        char *beg = p;
        char *end = token_end(p+1);
        if (end - beg > 1) {
            /* here, beg points to the '$' of the current token,
             * and end to the character just after the token */
            printf("token(%li,%li)", beg - str, end - str);
            /* parse the token, save it, etc... */
            p = end;
            continue;
        }
    }
    /* do something with a character which does not belong to a token... */
    printf("%c", *p);
    ++p;
}
Michał Trybus