tags:

views:

1555

answers:

16

I need help of a real C guru to analyze a crash in my code. Not for fixing the crash; I can easily fix it, but before doing so I'd like to understand how this crash is even possible, as it seems totally impossible to me.

This crash only happens on a customer machine and I cannot reproduce it locally (so I cannot step through the code using a debugger), as I cannot obtain a copy of this user's database. My company also won't allow me to just change a few lines in the code and make a custom build for this customer (so I cannot add some printf lines and have him run the code again) and of course the customer has a build without debug symbols. In other words, my debbuging abilities are very limited. Nonetheless I could nail down the crash and get some debugging information. However when I look at that information and then at the code I cannot understand how the program flow could ever reach the line in question. The code should have crashed long before getting to that line. I'm totally lost here.

Let's start with the relevant code. It's very little code:

// ... code above skipped, not relevant ...

if (data == NULL) return -1;

information = parseData(data);

if (information == NULL) return -1;

/* Check if name has been correctly \0 terminated */
if (information->kind.name->data[information->kind.name->length] != '\0') {
    freeParsedData(information);
    return -1;
}

/* Copy the name */
realLength = information->kind.name->length + 1;
*result = malloc(realLength);
if (*result == NULL) {
    freeParsedData(information);
    return -1;
}
strlcpy(*result, (char *)information->kind.name->data, realLength);

// ... code below skipped, not relevant ...

That's already it. It crashes in strlcpy. I can tell you even how strlcpy is really called at runtime. strlcpy is actually called with the following paramaters:

strlcpy ( 0x341000, 0x0, 0x1 );

Knowing this it is rather obvious why strlcpy crashes. It tries to read one character from a NULL pointer and that will of course crash. And since the last parameter has a value of 1, the original length must have been 0. My code clearly has a bug here, it fails to check for the name data being NULL. I can fix this, no problem.

My question is:
How can this code ever get to the strlcpy in the first place?
Why does this code not crash at the if-statement?

I tried it locally on my machine:

int main (
    int argc,
    char ** argv
) {
    char * nullString = malloc(10);
    free(nullString);
    nullString = NULL;

    if (nullString[0] != '\0') {
     printf("Not terminated\n");
     exit(1);
    }
    printf("Can get past the if-clause\n");

    char xxx[10];
    strlcpy(xxx, nullString, 1);
    return 0; 
}

This code never gets passed the if statement. It crashes in the if statement and that is definitely expected.

So can anyone think of any reason why the first code can get passed that if-statement without crashing if name->data is really NULL? This is totally mysterious to me. It doesn't seem deterministic.

Important extra information:
The code between the two comments is really complete, nothing has been left out. Further the application is single threaded, so there is no other thread that could unexpectedly alter any memory in the background. The platform where this happens is a PPC CPU (a G4, in case that could play any role). And in case someone wonders about "kind.", this is because "information" contains a "union" named "kind" and name is a struct again (kind is a union, every possible union value is a different type of struct); but this all shouldn't really matter here.

I'm grateful for any idea here. I'm even more grateful if it's not just a theory, but if there is a way I can verify that this theory really holds true for the customer.

Solution

I accepted the right answer already, but just in case anyone finds this question on Google, here's what really happened:

The pointers were pointing to memory, that has already been freed. Freeing memory won't make it all zero or cause the process to give it back to the system at once. So even though the memory has been erroneously freed, it was containing the correct values. The pointer in question is not NULL at the time the "if check" is performed.

After that check I allocate some new memory, calling malloc. Not sure what exactly malloc does here, but every call to malloc or free can have far-reaching consequences to all dynamic memory of the virtual address space of a process. After the malloc call, the pointer is in fact NULL. Somehow malloc (or some system call malloc uses) zeros the already freed memory where the pointer itself is located (not the data it points to, the pointer itself is in dynamic memory). Zeroing that memory, the pointer now has a value of 0x0, which is equal to NULL on my system and when strlcpy is called, it will of course crash.

So the real bug causing this strange behavior was at a completely different location in my code. Never forget: Freed memory keeps it values, but it is beyond your control for how long. To check if your app has a memory bug of accessing already freed memory, just make sure the freed memory is always zeroed before it is freed. In OS X you can do this by setting an environment variable at runtime (no need to recompile anything). Of course this slows down the program quite a bit, but you will catch those bugs much earlier.

+3  A: 

You may be experiencing stack corruption. The line of code you are refering to may not be being executed at all.

Blank Xavier
Hmmm... this is a possibility. Hard to check without being able to step through the code. What makes things a bit more complicated is that the whole function containing the code is inlined into another one, not because I use the inline attribute, but because GCC thinks it is a good idea to do so when optimization flags are used.
Mecki
+4  A: 

The effect of dereferencing the null pointer is undefined by standard as far as I know.

According to C Standard 6.5.3.2/4:

If an invalid value has been assigned to the pointer, the behavior of the unary * operator is undefined.

So there could be crash or could be not.

Kirill V. Lyadvinsky
Does this only hold true for C as a language or does this also hold true for UNIX according to POSIX standard? Since the most common behavior you see in UNIX when dereferencing a NULL pointer is that the application gets a signal which ususally terminates it. At least that's what I would expect as developer.
Mecki
Yes, this holds true for C as a language. Undefined means anything could happen. Depending on or expecting any particular thing to happen is a bad idea -- because it might not do what you expect, even across runs of the same program on the same platform.
Nick Meyer
The most common behaviour, in your customer's case, doesn't mean anything. Which is why you have to stick to the standard instead of implementation-specific quirks.
Michael Foukarakis
+2  A: 

The act of dereferencing a NULL pointer is undefined by the standard. It is not guaranteed to crash and often times won't unless you actually try and write to the memory.

JaredPar
Thought about this as well, but first strlcpy will not try to write to the NULL pointer either (it just reads from it, I have the right source of it here) and second this is in OS X, where a NULL pointer is just a pointer to first memory page of a process and this page has neither read nor write permission, so also when you try to read from it the program gets a SIGBUS signal.
Mecki
A: 

Wow, thats strange. One thing does look slightly suspicious to me, though it may not contribute:

What would happen if information and data were good pointers (non null), but information.kind.name was null. You don't dereference this pointer until the strlcpy line, so if it was null, it might not crash until then. Of course, earlier than t hat you do dereference data[1] to set it to \0, which should also crash, but due to whatever fluke, your program may just happen to have write access to 0x01 but not 0x00.

Also, I see you use information->name.length in one place but information->kind.name.length in another, not sure if thats a typo or if thats desired.

bdk
Sorry, this once is a typo. Fixing it. I changed the names somewhat to more obvious names (otherwise this small code excerpt is unreadable) :-)
Mecki
And regarding your answer: I do dereference it before strlcpy. x[0] is the same as (*x) and x[1] is the same as *(x + 1) and so on. Every array access, read or write access, dereferences the pointer. I don't write to it, though, I only try to read from it (but strlcpy also only tries to read from it; I have the real strlcpy code of the correct libC here and verified that).
Mecki
+12  A: 

First, dereferencing a null pointer is undefined behavior. It can crash, not crash, or set your wallpaper to a picture of SpongeBob Squarepants.

That said, dereferencing a null pointer will usually result in a crash. So your problem is probably memory corruption-related, e.g. from writing past the end of one of your strings. This can cause a delayed-effect crash. I'm particularly suspicious because it's highly unlikely that malloc(1) will fail unless your program is butting up against the end of its available virtual memory, and you would probably notice if that were the case.

Edit: OP pointed out that it isn't result that is null but information->kind.name->data. Here's a potential issue then:

There is no check for whether information->kind.name->data is null. The only check on that is

if (information->kind.name->data[information->kind.name->length] != '\0') {

Let's assume that information->kind.name->data is null, but information->kind.name->length is, say, 100. Then this statement is equivalent to:

if (*(information->kind.name->data + 100) != '\0') {

Which does not dereference NULL but rather dereferences address 100. If this does not crash, and address 100 happens to contain 0, then this test will pass.

Tyler McHenry
When my OS sets my wallpaper to a Sponge Bob image if I deference a NULL pointer, I'll turn it back :-P BTW malloc is not failing here, *result is not NULL, it is 0x341000 in the process address space. It is the name->data that is NULL and this is okay, as the data I parsed may not contain a name... that it should have a name according to customer and there might be an error in the parse function as well is another issue I guess.
Mecki
Thanks for the clarification, see edit.
Tyler McHenry
But, as the OP points out, information->kind.name->length must be zero because he knows that realLength is 1.
Martin B
+1  A: 

My theory is that information->kind.name->length is a very large value so that information->kind.name->data[information->kind.name->length] is actually referring to a valid memory address.

Bombe
Another idea that came to my mind. If length is > 4095 I access memory after the first memory page (the NULL guard page with no read/write permission that usually cause the signal on access) and this will indeed not crash. What speaks against that is that realLength is 1 after adding 1 to the length value, so the length value must have been zero before.
Mecki
A: 

Despite the fact that dereferencing a null pointer leads to undefined behaviour and not necessarily to a crash, you should check the value of information->kind.name->data and not the contents of information->kind.name->data[1].

Michael Foukarakis
+1  A: 

As an FYI, when I see this line:

if (information->kind.name->data[information->kind.name->length] != '\0') {

I see up to three different pointer dereferences:

  1. information
  2. name
  3. data (if it's a pointer and not a fixed array)

You check information for non-null, but not name and not data. What makes you so sure that they're correct?

I also echo other sentiments here about something else possibly damaging your heap earlier. If you're running on windows, consider using gflags to do things like page allocation, which can be used to detect if you or someone else is writing past the end of a buffer and stepping on your heap.

Saw that you're on a Mac - ignore the gflags comment - it might help someone else who reads this. If you're running on something earlier than OS X, there are a number of handy Macsbugs tools to stress the heap (like the heap scramble command, 'hs').

plinth
Yes, you are definitely correct, none of those are checked. And to answer your question: I have no idea if they are correct. They might all be pointing into nowhere. But what I know is that the second parameter to strlcpy is NULL, thus data points to NULL and nothing in the program could have changed that, so it must have been NULL in the if statement as well... but there I dereferenced it and nothing went wrong (data[0] is the same as (*data), so it means dereferncing it, doesn’t it?).
Mecki
+7  A: 

It is possible that the structure is located in memory that has been free()'d, or the heap is corrupted. In that case, malloc() could be modifying the memory, thinking that it is free.

You might try running your program under a memory checker. One memory checker that supports Mac OS X is valgrind, although it supports Mac OS X only on Intel, not on PowerPC.

mark4o
Wow, brilliant reply! Haven't even thought about that one. Yes, indeed, the data pointer might point to something useful at the if statement, but may itself be located in already free'd (not yet re-used) memory. Calling malloc may cause changes to that free memory so that now the data pointer points to NULL all of a sudden. This is definitely possible!
Mecki
So far this reply looks most promising; something so simple and still I never thought about this possibility. Of course malloc may change any memory used in my code if I accidentally free'd it. I can test this with the customer. You can force OS X to load an alternative malloc implemention (by setting an environment variable) which makes sure accessing free'd memory crashes immediately (for debugging purposes). Let's see what that will tell us :-)
Mecki
You, sir, are a genius :-) I have not considered that calling malloc can do things like changing the VM mapping of the process, zero out pages and so on. I can now reproduce the issue locally and indeed, when I skip over that malloc call, it won't crash. So accepted reply, congratulations!
Mecki
If you can run your program on an Intel Mac then I would still recommend running it under valgrind's memcheck tool, to look for other possible heap corruption, buffer overflows, use of uninitialized values, memory leaks, etc. If you have a test suite that can be periodically run under valgrind then that is even better.
mark4o
A: 
char * p = NULL;

p[i] is like

p += i;

which is a valid operation, even on a nullpointer. it then points at memory location 0x0000[...]i

StampedeXV
Yes, but "p[i] == 0" is like "*(p + i) == 0" and that is not possible if p is NULL. Okay, not quite right, it is possible if i is larger than a page size on most systems (as then you are not accessing the NULL guard on most OSes and thus not causing a crash), but in my case realLength is 1 after adding 1 to length and thus length must have been 0 before. And in case i is zero above we have "*(p + 0) == 0" and this will crash if p is NULL, give it a try. Also "if (p[0] == 0)" will crash if p is NULL, try it.
Mecki
did you check what NULL is? NULL usually is a #define in C. It could also be 0xFFFFFFFFF or someting arbitrary. :)
StampedeXV
On Mac OS X NULL is indeed defined as "(void *)0" - but you are right, C doesn't say it NULL must be 0, it can be any value. NULL is just NULL and you shouldn't care as developer what that means.
Mecki
+1  A: 

I'm interested in the char* cast in the call to strlcpy.

Could the type data* be different in size than the char* on your system? If char pointers are smaller you could get a subset of the data pointer which could be NULL.

Example:

int a = 0xffff0000;
short b = (short) a; //b could be 0 if lower bits are used

Edit: Spelling mistakes corrected.

Key
Thought about this cast, too... in theory that could have caused the issue, but in practice I found the issue (see update of question) and that was not it - on my system the pointers are all equal in size. Still, you get an upvote because you are the first one who thought about the cast being a possible cause of all this; if a system has really different pointer sizes for different data types (which is perfectly valid I think), this could cause such an issue.
Mecki
+1  A: 

Here's one specific way you can get past the 'data' pointer being NULL in

if (information->kind.name->data[information->kind.name->length] != '\0') {

Say information->kind.name->length is large. Atleast larger than 4096, on a particular platform with a particular compiler (Say, most *nixes with a stock gcc compiler) the code will result in a memory read of "address of kind.name->data + information->kind.name->length].

At a lower level, that read is "read memory at address (0 + 8653)" (or whatever the length was). It's common on *nixes to mark the first page in the address space as "not accessible", meaning dereferencing a NULL pointer that reads memory address 0 to 4096 will result in a hardware trap being propagated to the application and crash it.

Reading past that first page, you might happen to poke into valid mapped memory, e.g. a shared library or something else that happened to be mapped there - and the memory access will not fail. And that's ok. Dereferencing a NULL pointer is undefined behavior, nothing requires it to fail.

leeeroy
Yes, this is perfectly right (you know your UNIX VM, don't you? ;-)); if length is 4096 or bigger it will indeed not crash. Only problem is, we can say for sure length is 0, otherwise realLength was not 1 when the crash happens (and real length is 1 when the crash happens). Still, good idea. You get an upvote for pointing this out.
Mecki
A: 

You should always check whether information->kind.name->data is null anyway, but in this case

in

if (*result == NULL) 
    freeParsedData(information);
    return -1;
}

you have missed a {

it should be

if (*result == NULL)
{ 
     freeParsedData(information);
     return -1;
}

This is a good reason for this coding style, instead of

if (*result == NULL) { 
    freeParsedData(information);
    return -1;
}

where you might not spot the missing brace because you are used to the shape of the code block without the brace separating it from the if clause.

Captain Lepton
Sorry, I missed that one, it is there in the real source code.
Mecki
Great, but in any case I strongly recomment that coding style as it lines up the start and end braces horizontally, making it easier to find corresponding braces and generally easier to see the code structure.You can see that one of the most fundamental jobs when programming with dynamic memory in C is to keep track of allocating and freeing memory and making sure you have covered every case. It is standard and sensible to always set the pointer to NULL after a free in order to use the pointer itself as a flag indicating non-allocated memory.
Captain Lepton
+1  A: 

Missing '{' after last if statement means that something in the "// ... code above skipped, not relevant ..." section is controlling access to that entire fragment of code. Out of all the code pasted only the strlcpy is executed. Solution: never use if statements without curly brackets to clarify control.

Consider this...

if(false)
{
    if(something == stuff)
    {
     doStuff();

    .. snip ..

    if(monkey == blah)
     some->garbage= nothing;
     return -1;
    }
}
crash();

Only "crash();" gets executed.

Kaloth
Of course, the compiler would catch the mismatched {}, but you never know... ;)
Pod
Yes, but that was just because I didn't copy the code directly, instead I typed it and forget to add it. It is there in the real source file :-P I just did that as the names in the real source file are very cryptic (to not say ugly) and using more obvious names makes it easier to read.
Mecki
A: 

*result = malloc(realLength); // ???

Address of newly allocated memory segment is stored at the location referenced by the address contained in the variable "result".

Is this the intent? If so, the strlcpy may need modification.

Mecki
+1  A: 

I would run your program under valgrind. You already know there's a problem with NULL pointers, so profile that code.

The advantage that valgrind beings here is that it checks every single pointer reference and checks to see if that memory location has been previously declared, and it will tell you the line number, structure, and anything else you care to know about memory.

As every one else mentioned, referencing the 0 memory location is a "que sera, sera" kinda thing.

My C tinged spidey sense is telling me that you should break out those structure walks on the

if (information->kind.name->data[information->kind.name->length] != '\0') {

line like

    if (information == NULL) {
      return -1; 
    }
    if (information->kind == NULL) {
      return -1; 
    }

and so on.

Shawn Leslie
You are right; checking every single pointer against NULL before using it is the correct way of doing it. Okay, in this case it hadn't solved anything (see accepted solution and my update to the question), but even though I found the reason for all this, I change the code exactly to something that looks like in your example :-)
Mecki