views:

441

answers:

5
char *s = "hello ppl.";
for (i = 0; i < strlen(s); i++) {
    char c = s[i];
    if (c >= 97 && c <= 122) {
        c += 2;
        s[i] = c;
    }
}

I want to rotate the string by two characters: "hello ppl." -> "jgnnq rrn."

I am getting a segmentation fault. What is wrong with the code?

+31  A: 

The code:

 char *s = "hello ppl.";

gives you a pointer to what it almost certainly read-only memory. That's because string constants in C are const char *. When you try to write to that memory, you'll most likely get a segmentation violation. Try this instead:

char s[] = "hello ppl.";

which is conceptually the same as:

char s[11];               // for the whole string plus null terminator
strcpy (s, "hello ppl.");

In other words, it puts the string you want to change into writable memory. The following code:

#include <stdio.h>
#include <string.h>
int main(void) {
    int i;
    char s[] = "hello ppl.";
    for (i = 0; i < strlen(s); i++) {
        char c = s[i];
        if (c >= 97 && c <= 122) {
            c += 2;
            s[i] = c;
        }
    }
    printf("%s\n",s);
    return 0;
}

gives you "jgnnq rrn.".

A few other things I'd like to point out which are not fatal:

  • It's not usually a good idea to use 'magic' numbers like 97 and 122. It's just as easy, and clearer in intent, to use 'a' and 'z'.
  • If you really want to rotate, you can't blindly add 2 to 'y' and 'z'. You have to treat them specially (subtract 24) so that they map correctly to 'a' and 'b'.
  • The C standard doesn't guarantee that alpha characters are contiguous. If you know you're using ASCII, you're probably okay but I thought I'd just mention that. As an aside, it does guarantee that for the numeric characters.

Having said that, I'd rather use a mapping table as follows:

#include <stdio.h>
#include <string.h>
int main (void) {
    char *lkupPtr, *strPtr;
    char str[] = "hello ppl.";
    const char * const from = "abcdefghijklmnopqrstuvwzyz";
    const char * const to   = "cdefghijklmnopqrstuvwzyzab";

    for (strPtr = str; *strPtr != '\0'; strPtr++)
        if (lkupPtr = strchr (from, *strPtr)) != NULL)
            *strPtr = to[(int)(lkupPtr - from)];

    printf("%s\n",str);
    return 0;
}

This takes care of all the points I raised above and you can, if necessary, add more mappings if you're in an internationalized environment (rather than just plain ASCII or EDCDIC).

This should be, in my opinion, fast enough for all but the most demanding of requirements (I clocked it at over 3 million characters per second on my PC). If you have a near-insatiable need for performance over and above that, yet don't want to opt for hand-crafted assembly targeted to your specific CPU, you could try something like the following.

It's still fully compliant with the C standard but may deliver better performance by virtue of the fact all heavy calculation work is done once at the start. It creates a table holding all possible character values, initializes it so that every character translates to itself by default, then changes the specific characters you're interested in.

That removes any checking for characters from the translation itself.

#include <stdio.h>
#include <string.h>
#include <limits.h>

static char table[CHAR_MAX + 1];
static void xlatInit (void) {
    int i;
    char * from = "abcdefghijklmnopqrstuvwzyz";
    char * to   = "cdefghijklmnopqrstuvwzyzab";
    for (i = 0; i <= CHAR_MAX; i++) table[i] = i;
    while (*from != '\0') table[*from++] = *to++;
}

int main (void) {
    char *strPtr;
    char str[] = "hello ppl.";

    xlatInit(); // Do this once only, amortize the cost.

    for (strPtr = str; *strPtr != '\0'; strPtr++)
        *strPtr = table[*strPtr];

    printf("%s\n",str);
    return 0;
}
paxdiablo
You might want to include string.h , since you are giving an example that can be pasted and compiled.
Tim Post
And it would be better if the code used 'islower()' from '`<ctype.h>`' - for all the original was written as shown.
Jonathan Leffler
Good point, @Tim. Done. @JL, another good point but not necessary for my preferred from/to solution.
paxdiablo
C and later C++ optimizing compilers have been replacing for(i=0; i<n; i++) foo(a[i]); with for(sp=a; sp<a+n; sp++) foo(sp); for at least 20 years, so no performance is gained by doing so manually unless you have a really terrible compiler. I do think it is more readable, though, since it is more in C string processing style.
Ray Burns
If you're a performance junkie, strchr is not your friend, at least not the way it is used here. A simple *sp = (*sp<'a' || *sp>'z') ? *sp : (*sp-'a' + 2)%26 + 'a'; will do the trick nicely and run about 10x faster. (and yes, the compiler will optimize the multiple *sp usage into a single memory read)
Ray Burns
@Ray, it wasn't the switch from arrays to pointers that was important, it was the move away from evaluating strlen every time through the loop. You could of course do that once outside the loop but, since that required a change anyway, I opted to go further, to save variables as much as anything :-). And did you not read my rationale for the lookup? The C standard does *not* guarantee letters are contiguous code points. The solution I gave will work for all character sets. By all means use your solution, or a slightly more readable one :-), if you know they're contiguous.
paxdiablo
And, on top of that, we don't *know* how strchr is implemented. Maybe we have such a super-smart compiler, it knows the underlying character set and optimizes strchr to your very method. Yes, I know that's a long shot but it *is* true that libc code is usually far more optimized than anything you can write in C itself - it often uses assembly instructions that can't be used by a compiler needing to handle generalized code.
paxdiablo
@paxdiablo thanks for the brilliant discussion. I have a lot to learn :)
Amoeba
@paxdiablo: Your code is much slower than necessary. On an Intel Core 2 Duo T7400 VC++ 9, the original solution runs in 1.99ns. Your solution takes 3.32ns but handles the case of non-contiguous charsets. My solution below also handles non-contiguous charsets but takes on 2.38ns, including the initialization code (without the init code it takes only 0.35ns). I think it was your claim to be a "performance junkie" that set me off. I just think you should show the fastest way to do it.
Ray Burns
@Ray. I don't intend arguing the point. I've already stated that, if you make assumptions re the environment, you can speed it up. I prefer to leave my code as-is since it's implementation-neutral. One of the guidelines I follow is "get it working first, then get it working fast".
paxdiablo
I'd only worry about performance, other than obvious things like multiple calls to strlen on an unchanging string, of course :-), if its a real problem. By your measurements (3.32ns/10 chars), this one is still able to crank out three billion characters per second. If you need more speed, you may have to sacrifice portability, not in itself a bad thing, but I'd only do it if necessary.
paxdiablo
You have written `const char const *` which is valid C99 code, but i think the double const is invalid in C89 and at the very least it's redundant :)
Johannes Schaub - litb
The code I wrote below is 850% faster, does not sacrifice portability, makes no assumptions about the environment, and handles all charsets including EBCDIC and 9 bit char types. It will run on any machine with a C compiler. No speed/portability tradeoff is necessary.
Ray Burns
My code handles 30 billion characters per second. Yours handles 3 billion. Both are simple and completely platform neutral. Which is better?
Ray Burns
@litb, I don't think it's redundant (but I've been wrong before), they're two separate things. One means the *pointer* is constant, the other means the data is points to is constant. This *should* allow for insane levels of optimization (I would think). Whether it's valid for c89 I don't know, I generally use gcc in c99 mode.
paxdiablo
@Ray, "better" is a subjective term, I prefer measurable terms like "faster". And yes, your code is now faster but, until you made the change to your array size, it was *not* portable (hence the comments). As to whether there is actually a difference in handling 3 or 30 billion chars per second, that's for whoever is using it. In my opinion, unless you're actually having to process really huge quantities (many, *many* billions), it won't make a difference. But, as I said, if you need that speed, then optimize - and that would include bypassing C altogether and hand-crafting assembler.
paxdiablo
@paxdiablo, well i think you meant `const char * const` then :) If you put const before the star, then the order where it appears is arbitrary, (i usually prefer `char const *const` for instance), so one of the two consts you have there is redundant :)
Johannes Schaub - litb
Thanks, @litb, changed it to the correct form, const char * const.
paxdiablo
+5  A: 

In char *s = "hello ppl." you are not allocating any memory, instead you are pointing to a string which may be residing in read-only memory of your program. Ideally it should be const char*. Now if you try to modify it, it will crash.

Naveen
String literals are typed as const char* in the read-only static string table when compiled. Good call on this @Naveen: the type information of string literals is lost on many people.
rpj
+5  A: 

The code:

char *s = "hello ppl.";

creates an entry in the string table, typically in the code segment (read only space of the program). Any attempts to change it will cause the segfault by trying to modify read only memory. The appropriate way to create/initialize a string to be modified is:

char s[] = "hello ppl.";
Matt
+6  A: 

The variable s points to read-only memory. This means that it cannot be modified. You'll want to use:

char varname[] = "...";

Gotchas to watch out for:

char varname[] = "...";

Places the data on the stack. Make sure you are not returning a pointer to a functions local data. If that's the case you'll want to look at malloc to allocate memory in the heap.

Another gotcha:

for (i = 0; i < strlen(s); i++) {...} is O(N^2)

The reason is that strlen(s) is an O(N) operation you do each time through the loop. An improvement would be:

int len = strlen(s);
for(i=0;i<len;i++) { ... }

This way we only do the strlen(s) computation once and reuse the result.

Evan
+1 for comment about strlen() in the loop. I'm not wholly convinced about the term static memory; readonly memory might be better.
Jonathan Leffler
GCC optimises the `strlen` out with `-O2`. I guess the result is known at compile time because the `s` is known at the time `strlen` is called, and it points to readonly memory.
dreamlax
That's ... brave ... of the compiler. It deosn't point to read-only memory at all (char x[] = "ff"; puts it on the stack). I'd be interested as to how gcc figures this out. Or did you mean strlen in the original where it was read-only?
paxdiablo
@paxdiablo: I mean the original posted by the OP in his question.
dreamlax
+3  A: 

As others have mentioned,

char *s = "hello ppl.";

points to read-only memory because it is a string literal. It should be

char s[] = "hello ppl.";

which creates an array in read-write memory and copies the string into it.

Ignoring non-ASCII character sets, the problem can be solved most efficiently like this:

void Convert(char *s)
{
  for(char *sp = s; *sp; sp++)
    if(*sp >= 'a' && *sp <= 'z')
      *sp = (*sp - 'a' + 2) % 26 + 'a';
}

If you're dealing with EBCDIC or any other charset that doesn't have contiguous alphabetic characters, you can use a map:

char *from = "abcdefghijklmnopqrstuvwxyz";
char *to   = "cdefghijklmnopqrstuvwxyzab";
char map[CHAR_MAX+1];

void Initialize()
{
  for(int i=0; from[i]; i++)
    map[from[i]] = to[i];
}

void Convert(char *s)
{
  for(char *sp = s; *sp; sp++)
    if(map[*sp])
      *sp = map[*sp];
}

The compiler will optimize each of these to nearly optimal assembly language.

Update In the original problem there was no separate Initialize() call, so I optimized the code to make "Initialize(); Convert(s);" as fast as possible. If you are able to call Initialize() ahead of time and only care about how fast "Convert(s);" runs, the optimal code will fill the array first, like this:

char *from = "abcdefghijklmnopqrstuvwxyz";
char *to   = "cdefghijklmnopqrstuvwxyzab";
char map[CHAR_MAX+1];

void Initialize()
{
  int i;
  for(i=0; i<=CHAR_MAX; i++)  // New code here fills the array
    map[i] = i;
  for(i=0; from[i]; i++)
    map[from[i]] = to[i];
}

void Convert(char *s)
{
  for(char *sp = s; *sp; sp++)  // 'if' removed
    *sp = map[*sp];
}

This modified code is 375% slower if you are calling "Initialize(); Convert(s);", but it is 3% faster if you have already called Initialize() and you are only timing "Convert(s);".

Ray Burns
Other than the problem that a character set may have more than 256 characters, that's not bad. But why wouldn't you change Initialize to first set map[i] = i for i= 0..255, then change the mappings for the specific characters in from and to? That way, you're lookup code can ditch the 'if' altogether.
paxdiablo
@paxdiablo: This solution is already much faster than yours (2.38ns vs 3.32ns on an Intel Core Duo T7400 VC++ 9 using the given input data). Unfortunately your suggestion would slow it down, not speed it up. I tried code filling the whole map in advance and omitting the 'if', and it turned out to MUCH slower at 11.31ns, almost 5x as slow. That's why I used the 'if'. Of course if the string were much longer or the initialization were a separate step, the complete map would be faster (I timed it at 0.35ns with the initialization step done in advance).
Ray Burns
Fixed the hard-coded 256 to make it portable even to machines with 9 bit and 16 bit char types
Ray Burns
Johannes Schaub - litb
@Ray, well you say you added the `if` *because* your code would be slower with the complete map filled *because* you call `Initialize` each time. But in your code you don't do that, as far as i can see. Have you missed putting `Initialize();` somewhere?
Johannes Schaub - litb
@litb: Good question. It was because I was optimizing "Initialize(); Convert(s);" instead of just "Convert(s);". The original problem didn't allow a separate Initialize() call, so I really only put the initialization code in a separate function for readability and to suggest it could be called separately. I have added an update to my answer that explains this further and gives the modified code and related statistics.
Ray Burns
The separate call to Initialize() is a good idea - it amortizes the cost of setting up the translation table across all calls to Convert(). Not bad. +1.
paxdiablo
My $0.02: You could make a global variable, `static int initialized = 0;` and in `Initialize()` check `if(initialized) return;` and add `initialized = 1;` at the end. Alternatively (or additionally), we could add `if(!initialized) { Initialize(); initialized = 1; }` to `Convert()`. Basically, the setup may lose a tiny amount of speed, but we never _have_ to call `Initialize()` separately - it'll be called for us if we forget to. Of course, this is a bit un-C-like, but it's still nice.
Chris Lutz