views:

116

answers:

7
+1  Q: 

Filter string in C

How can I filter a string in c? I want to remove anything that isn't [a-z0-9_].

int main(int argc, char ** argv) {
   char* name = argv[1];
   // remove anything that isn't [a-z0-9_]

   printf("%s", name);
}
A: 

The C standard library doesn't supply any support for Regular Expressions.
You'll either need to download a RegEx library in C (a very common one is PCRE), or do this in a loop (easier in the case at hand, since the expression sought are all single characters, hence no backtracking).

The loop approach would look something like:

int main(int argc, char ** argv) {
   char* name = argv[1];

   // remove anything that isn't [a-z0-9_]
   char strippedName[200];
   int iIn, iOut;  // subscript in Name and StrippedName respectively

   iIn = iOut = 0;
   while (name[iIn] != '\0' && iOut < (sizeof(strippedName) + 1)) {
      // some condition defining a desirable character
      // BTW, this condition should actually be
      //    if (islower(name[iIn]) || isdigit(name[iIn] || name[iIn] == '_')
      // to match the OP's requirement exactly 
      if (isalnum(name[iIn]) || name[iIn] == '_')
         strippedName[iOut++] = name[iIn];
      iIn++;
   }
   strippedName[iOut++] = '\0';

   printf("%s", strippedName);
}

Additional Regular expressions in the C language (other than PCRE mentioned earlier):

mjv
A regex engine is always good to have in your repository but I suspect it's like trying to kill a fly with a rocket launcher in this case :-)
paxdiablo
@paxdiablio: agreed; Being unsure of the context of the OP's question, I listed both.
mjv
In thinking about the OP's context, it sounds more and more like homework... I wish I had responded less directly (or ingored it altogether).
mjv
A: 

Take a look at isalphanum

rerun
A: 

Check out ctype for functions to test each character in a loop.

C.D. Reimer
`cctype` is a C++ header, but the question is tagged as `c`, so the header he should be using is `ctype.h`.
dreamlax
@dreamlax: Corrected mispelling. However, if you look closely at the web page, it says cctype (ctype.h). They're the same thing.
C.D. Reimer
@C.D. Reimer: They're not the same thing because you can't include `cctype` using a C compiler. The `cctype` header may contain syntax specific to C++; and on my system this is the case. In fact, my `ctype.h` header also defines one more function (`isblank`, introduced in C99) over my `cctype` header.
dreamlax
@dreamlax: I put together a C program using that web page as reference for ctype.h and I didn't have any problems with it.
C.D. Reimer
@C.D. Reimer: That's because the functions that the headers define behave identically, but that doesn't mean that the headers themselves are identical; given my case I have on extra function in my `ctype.h` header that is not available in `cctype`.
dreamlax
+1  A: 
char *src, *dst;
for (src = name, dst = name; *src; src++) {
   if ('a' <= *src && *src <= 'z' 
    || '0' <= *src && *src <= '9' 
    || *src == '_') *dst++ = *src;
}
*dst = '\0';

EDIT: Multiple small revisions. I hope to have the bugs out now.

Carl Smotricz
Not a problem for the vast majority of the worlds computers but the C standard in no way mandates that a-z are contiguous characters.
paxdiablo
True enough. I guess the really safe thing to do in that case would be to build an array of 256 'booleans' (like what's in ctype) with 'true' set for exactly the chars wanted, and to use that to do the check. Or better yet, use `islower()` and `isdigit()` like caf did. His solution really is better.
Carl Smotricz
I went with this one except I used `islower` and `isdigit` instead of the range test. Thanks!
Paul Tarjan
Nor does the standard mandate 256 chars but now I just being a pedantic a*hole :-)
paxdiablo
@Paul: Thank you! @pax: right again. I guess that makes a good reason to use those macros. OTOH, if the OP's spec had not fit a pre-made char type, then we'd really be hurting. I just looked into `ctype.h` and didn't enjoy what I saw. For a performance hit, one could of course also use `strchr()` to look up the char in a list represented by a string.
Carl Smotricz
+1  A: 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int main(int argc, char ** argv)
{    
    char *name, *inp, *outp;

    if (argc < 2)
    {
        fprintf(stderr, "Insufficient arguments.\n");
        return 1;
    }

    inp = argv[1];
    name = malloc(strlen(inp) + 1);
    outp = name;

    if (!name)
    {
        fprintf(stderr, "Out of memory.\n");
        return 2;
    }

    while (*inp)
    {
        if (islower((unsigned char)*inp) || isdigit((unsigned char)*inp) || *inp == '_')
            *outp++ = *inp;
        inp++;
    }

    *outp = '\0';

    puts(name);
    free(name);

    return 0;
}
caf
any reason to make a new string instead of doing it inplace?
Paul Tarjan
Not really, modifying `*argv` just seems a little crass ;)
caf
Why do you have to modify anything? Just examine each character of input and output only the valid ones.
dreamlax
A: 

Try Oniguruma regex library

S.Mark
Same comment as for mjv. I don't see this particular problem as being complex enough to justify a full-blown regex engine. I don't include all of SQLite just so I can store a couple of persistent items to disk - far better to just use fprintf.
paxdiablo
Agree with pax. This is sorta overkill
ItzWarty
+1  A: 

If you just want to strip those unwanted characters out of the first argument, there's no need for memory allocation, just walk through the input string character-by-character. And, if you know you'll be working in an ASCII environment (or any other that supports contiguous a through z), you could even replace the function calls with faster versions checking the character ranges.

But, I can't see the increase in speed as being enough to justify non-portable code.

#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(int argc, char ** argv) {
    int i;
    char *p;
    if (argc > 1) {
        for (p = argv[1]; *p != '\0'; p++) {
           if (islower(*p) || isdigit(*p) || *p == '_') {
               putchar (*p);
           }
        }
        putchar ('\n');
    }
    return 0;
}
paxdiablo
I like your loop except for the putchar. I'll put it back in the original string.
Paul Tarjan