ansaurus

Question

How to remove punctuation from a String in C

Answer 1

+11 A:

Loop over the characters of the string. Whenever you meet a punctuation (ispunct), don't copy it to the output string. Whenever you meet an "alpha char" (isalpha), use tolower to convert it to lowercase.

All the mentioned functions are defined in <ctype.h>

You can either do it in-place (by keeping separate write pointers and read pointers to the string), or create a new string from it. But this entirely depends on your application.

Eli Bendersky 2009-12-03 18:05:08

more detail here: http://stackoverflow.com/questions/421616/best-way-to-strip-punctuation-from-a-string

TStamper 2009-12-03 18:08:25

TStamper, there seems to be no C example there! C#, C++, but no C

Eli Bendersky 2009-12-03 18:10:24

@eliben- I was meaning in detail as for examples, not language specific

TStamper 2009-12-03 18:12:30

By the time I wrote full compilable program and tested it, there was already another answer doing basically the same thing. So, I deleted mine and upvoted @asveikau's answer.

Sinan Ünür 2009-12-03 18:19:28

SO... you snooze - you lose, survival of the fit^H^Hastest. :-)

Eli Bendersky 2009-12-03 18:20:52

Answer 2

+8 A:

Just a sketch of an algorithm using functions provided by ctype.h:

#include <ctype.h>

void remove_punct_and_make_lower_case(char *p)
{
    char *src = p, *dst = p;

    while (*src)
    {
       if (ispunct((unsigned char)*src))
       {
          /* Skip this character */
          src++;
       }
       else if (isupper((unsigned char)*src))
       {
          /* Make it lowercase */
          *dst++ = tolower((unsigned char)*src);
          src++;
       }
       else if (src == dst)
       {
          /* Increment both pointers without copying */
          src++;
          dst++;
       }
       else
       {
          /* Copy character */
          *dst++ = *src++;
       }
    }

    *dst = 0;
}

Standard caveats apply: Completely untested; refinements and optimizations left as exercise to the reader.

asveikau 2009-12-03 18:08:49

Don't forget to add that '\0' in the end !!

Eli Bendersky 2009-12-03 18:12:19

Nice catch. Fixed.

asveikau 2009-12-03 18:13:21

You should cast the argument of the `is*` or `to*` functions to `unsigned char`. That is not a refinement or optimization!

pmg 2009-12-03 18:14:35

@pmg Or I could say this is restricted to ASCII strings. Like I say, it's a sketch of an algorithm. :-) At any rate, I was going to update it but it looks like Sinan beat me to it. Thanks guys.

asveikau 2009-12-03 18:18:44

Since tolower() is usually implemented as a macro, you want to take the post-increment operator out of there, otherwise you'll have some nasty side-effects.

Ferruccio 2009-12-03 20:32:51

The second else if clause should be: else if (*src == *dst)But you could actually take it out completely and let the final else just copy matching characters.

Ferruccio 2009-12-03 20:35:42

@Ferruccio Very good point about the macro. I remember reading that in a manpage for something in ctype, once, actually. Fixed. As for the last else if... it's kind of a micro-optimization anyway. My original solution copied matching characters.

asveikau 2009-12-03 21:16:52

@asveikau - yes, but the else clause is comparing the pointers, not the characters they point to.

Ferruccio 2009-12-03 21:53:08

@Ferruccio yes, I was not thinking about it from the point of view of equal characters at different addresses. I was thinking about the fact that if the addresses are equal you don't need to to "move" characters. at any rate the whole thing is a bit of a micro-optimization, though my guess is comparing pointers is faster than dereferencing bytes and comparing the values. :P

asveikau 2009-12-03 23:09:23

Answer 3

+4 A:

The idiomatic way to do this in C is to have two pointers, a source and a destination, and to process each character individually: e.g.

#include <ctype.h>

void reformat_string(char *src, char *dst) {
    for (; *src; ++src)
        if (!ispunct((unsigned char) *src))
            *dst++ = tolower((unsigned char) *src);
    *dst = 0;
}

src and dst can be the same string since the destination will never be larger than the source.

Although it's tempting, avoid calling tolower(*src++) since tolower may be implemented as a macro.

Avoid solutions that search for characters to replace (using strchr or similar), they will turn a linear algorithm into a geometric one.

Ferruccio 2009-12-03 18:15:11

Arguments to `ctype.h` functions must be cast to `unsigned char`.

Sinan Ünür 2009-12-03 18:20:58

The argument to the `is*` and `to*` function should be cast to `unsigned char`.

pmg 2009-12-03 18:22:58

thanks, it's been a long time since I've written production C code

Ferruccio 2009-12-03 18:24:08

Searching for characters to replace will not make the algorithm exponential. Perhaps you are thinking it will change O(n) to O(n^2). This would be a geometric algorithm, not exponential (O(2^n)). But unless the characters to be replaced depends on the input in some way, the searching version will only multiply the algorithm's time by some constant (the number of such characters), which is still O(n) (though, obviously, a much less efficient O(n)).

Jeffrey L Whitledge 2009-12-03 18:56:49

@Jeffrey - you're right. I was thinking of O(n^2) as exponential.

Ferruccio 2009-12-03 19:59:57

I seem to recall seeing weird behavior if you call tolower(c) and isupper(c) returns false. So I usually shield to*() calls with is*() first.

asveikau 2009-12-03 21:18:44

@asveikau - you may be thinking of _tolower(), which requires that it is passed an upper-case character. tolower() is supposed to check for that.

Ferruccio 2009-12-03 21:49:56

@Ferruccio Nope, I'm pretty sure this was tolower() (no underscore) with glibc. I wasn't aware that an underscored version exists.

asveikau 2009-12-03 23:17:52

Answer 4

A:

Here's a rough cut of an answer for you:

void strip_punct(char * str) {
    int i = 0;
    int p = 0;
    int len = strlen(str);
    for (i = 0; i < len; i++) {
     if (! ispunct(str[i]) {
      str[p] = tolower(str[i]);
      p++;
     }
    }
}

Chris R 2009-12-03 18:15:30

See *Shlemiel the painter's algorithm*: http://www.joelonsoftware.com/articles/fog0000000319.html

Sinan Ünür 2009-12-03 18:20:23

The argument to the `is*` and `to*` function should be cast to `unsigned char`.

pmg 2009-12-03 18:22:25

Shlemiel does not apply here: the `strlen()` function is used once outside the loop.

pmg 2009-12-03 18:26:40

@pmg It is not strictly Shlemiel but the function does traverse the string twice: Once to find the length and then to transform.

Sinan Ünür 2009-12-03 18:51:13

ansaurus

tags:

views:

answers:

How to remove punctuation from a String in C

related questions