ansaurus

Question

What is the safest way to pass strings around in C?

Answer 1

+4 A:

You can't return a pointer to an array that's allocated locally within the function. As soon as the function returns, that array is going to be clobbered.

Also, when you put

char localstr[MAX_STRLENGTH] = strcpy(localstr, somestr);

what happens is that strcpy() will copy the bytes into the localstr[] array, but then you have an unnecessary assignment thing going on. You could probably get the intended effect as two lines, thus ..

char localstr[MAX_STRLENGTH];
strcpy(localstr, somestr);

Also, it's bad form to embed a free() call inside a function like this. Ideally the free() should be visible at the same level of scope where the malloc() occurred. By the same logic it's a little dubious to allocate memory down in a function this way.

If you want a function to modify a string, a common convention goes something like so

// use a prototype like this to use the same buffer for both input and output
int modifyMyString(char buffer[], int bufferSize) {
    // .. operate you find in buffer[],
    //    leaving the result in buffer[]
    //    and be sure not to exceed buffer length
    // depending how it went, return EXIT_FAILURE or maybe
    return EXIT_SUCCESS;

// or separate input and outputs
int workOnString(char inBuffer[], int inBufSize, char outBuffer[], int outBufSize) {
    // (notice, you could replace inBuffer with const char *)
    // leave result int outBuffer[], return pass fail status
    return EXIT_SUCCESS;

Not embedding malloc() or free() inside will also help avoid memory leaks.

JustJeff 2010-04-15 00:08:00

I make a copy and point it to returnstr though...doesn't that avoid the local scope problem?

chucknelson 2010-04-15 00:10:29

@chucknelson: it might be a typo in the example, but what's returned is a pointer to the local array, not the newly allocated block that's pointed to in `returnstr`.

Michael Burr 2010-04-15 00:12:59

@MichaelBurr ack, you're right, fixing now...thanks!

chucknelson 2010-04-15 00:14:15

If only it could be guaranteed that the array would be clobbered once it goes out of scope ... the fact that it *might* get clobbered depending on how many other functions you call and it's relative position on the stack is what makes this such a hair-tearing experience.

Duncan 2010-04-15 00:35:34

@Duncan - yep, it's a great way to get an intermittent failure. also, one other stack wrecker is a context save due to an interrupt.

JustJeff 2010-04-15 00:46:06

Quick question on returning EXIT_FAILURE or EXIT_SUCCESS...should I have multiple returns, jumping out with EXIT_FAILURE if anywhere in a function I hit some error condition? I have one function with 4 return statements in this fashion, that doesn't seem very straightforward...it is readable though...?

chucknelson 2010-04-15 02:10:57

@chucknelson, yes. As soon as you hit an error from which you cannot recover without a loss of precision/reliability, you should return with an error status code. Such a thing is readable.

Michael Aaron Safyan 2010-04-16 01:38:48

Answer 2

+10 A:

Well-written C code adheres to the following convention:

All functions return a status code of type int, where a return value of 0 indicates success, and a -1 indicates failure. On failure, the function should set errno with an appropriate value (e.g. EINVAL).
Values that are "reported" by a function should be reported via the use of "out parameters". In other words, one of the parameters should be a pointer to the destination object.
Ownership of pointers should belong to the invoker; consequently, a function should not free any of its parameters, and should only free objects that it, itself, allocates with malloc/calloc.
Strings should be passed either as const char* objects or as char* objects, depending on whether the string is to be overwritten. If the string is not to be modified, then const char* should be used.
Whenever an array is passed that is not a NUL-terminated string, a parameter should be provided indicating the the number of elements in the array or the capacity of that array.
When a modifiable string/buffer (i.e. char*) object is passed into a function, and that function is to overwrite, append, or otherwise modify the string, a parameter indicating the capacity of the string/buffer needs to be provided (so as to allow for dynamic buffer sizes and to avoid bufffer overflow).

I should point out that in your example code, you are returning localstr and not returnstr. Consequently, you are returning an address of an object in the current function's stack frame. The current function's stack frame will disappear once the function has returned. Invoking another function immediately afterwards will likely alter the data in that location, leading to the corruption that you have observed. Returning the address of a local variable leads to "undefined behavior" and is incorrect.

Edit
Based on your updated code (get_fullpath), it is clear that the problem is not in your function get_fullpath, but rather in the function that is calling it. Most likely, the paths variable is being supplied by a function that returns the address of a local variable. Consequently, when you create a local variable within get_fullpath, it is using the same exact location on the stack that paths previously occupied. Since "paths" is aliasing "fullpaths", it is basically overwritten with the address of the buffer that you've malloced, which is blank.

Edit 2
I have created a C Coding Conventions page on my website with more detailed recommendations, explanations, and examples for writing C code, in case you are interested. Also, the statement that localstr is being returned instead of returnstr is no longer true since the question has last been edited.

Michael Aaron Safyan 2010-04-15 00:12:57

Nice, I like this list. Thanks for the help! I have much to learn about being disciplined...

chucknelson 2010-04-15 00:17:21

+1 for your third point!

JustJeff 2010-04-15 00:18:50

That's where I get lost, how is "fullpaths" even touching memory that is already allocated and being used in "paths"?

chucknelson 2010-04-15 00:27:34

@chucknelson - he's suggesting that the problem is that the memory is no longer allocated. The fact that your first printf() prints what you expect is actually undefined behavior - you happen to be seeing the last thing stored at that location in memory before it gets re-used by the new stack variable 'fullpath'.

Brian Roach 2010-04-15 00:57:47

If a call to `malloc` is changing the memory pointed to by "paths", that likely means that "paths" is pointing to memory that was already freed, or that your heap memory has been corrupted somehow (for example, by writing past the end of a malloc'd block).

David Gelhar 2010-04-15 01:06:57

Nice style guide, but it should be noted that the first point is only one possible error status solution. There are others, each with its pro and cons. And while error indication is now well defined, you've omitted the most important parts of it. 1) If a function returns a status, then you have to check the status after the function call. No exception, no discussion, no excuses. 2) The error handling is missing. So a function returned an error. And then? What to do now? Especially since errno is a global that may be overwritten with the next error.

Secure 2010-04-15 14:04:54

@Michael Aaron Safyan: in your coding conventions site, you have a bad practice in your buffers section. `buffer[resultlen]='\0';` writes a `0` based on the source data's length, not the size of the buffer. I see that you do check `bufferlen<(resultlen+1)` which is good. But it is better to be safe. It should be `buffer[bufferlen - 1] = '\0';`

Evan Teran 2010-04-15 19:21:34

@Secure, @Evan Teran, thank you both for your comments. I have incorporated your feedback.

Michael Aaron Safyan 2010-04-16 01:23:56

Hmm... now I've read it from top to bottom. It just feels wrong and smells on so many levels. There are many "should" without any explanation. WHY should I write my C code to be C++ compilable? I'm writing C, not C++. WHY should my error codes be negative? The only place "assert" is mentioned is to substitute the return status check, ie. the regular program flow, for which assert should NEVER be used, while it would be perfect for most of the EINVAL checks. The _unchecked version are an enormous interface bloat, adding an additional call level to any function, and therefor effectively (cont)

Secure 2010-04-16 07:01:32

(cont) slowing down everything. I would call this preparation for premature optimization. Seriously, if I would be new to C programming and found this style guide, I would be scared by this convoluted, verbose and complex "mess". At least leave everything out that has to do with optimization. Instead include the standard rules for optimization: Don't do it. Don't do it yet. Do it only after measuring for identified bottlenecks. Macro-optimize algorithms first, then the need for micro-optimizations may be gone. Premature optimization is the root of all evil. And all that.

Secure 2010-04-16 07:08:09

@Secure, the reason for returning a negative value is that all the system interfaces of IEEE Std. 1003.1 ("POSIX") return a -1 on failure. Thus testing for less-than-zero is a valid check for failure with those system interfaces, and so it is in keeping with the "principle of least surprise" to be consistent with that.

Michael Aaron Safyan 2010-04-16 12:38:40

@Secure, C code doesn't need to be compilable as C++, although it is good to keep the headers compatible. I could have removed the "EXTERNC" declarations from the C source file... I simply chose to leave them in, because it would have taken time to remove them, and as a bonus, leaving them there allows it to be compiled as C++.

Michael Aaron Safyan 2010-04-16 12:45:08

@Secure, I agree on the optimization bit... I would not have done it, except that it allows me to separate out the checking and the actual doing... otherwise, the logic becomes lost in the error checking. If it were purely for optimization, then I agree it wouldn't make sense.

Michael Aaron Safyan 2010-04-16 12:52:19

@Secure, with regard to the assert... the assert is being used for a check that is known to never fail (e.g. testing for null, and then passing the parameter to a function that only fails if the parameter is null). The assert is merely a documentation of intent.

Michael Aaron Safyan 2010-04-16 12:54:00

But why is it good to keep the headers compatible when I'm writing plain C, not C++? I'm a programmer, the less I have to type the less errors I can make. You say I should do so because it is a good idea. But it is more work and is visual interface bloat. There must be an advantage for this extra work. Just saying "it is good" smells like Cargo Cult. http://en.wikipedia.org/wiki/Cargo_cult_programming

Secure 2010-04-16 12:58:47

For the assert: I've said "no exception, no discussion, no excuses". Always do the checking. Omitting it introduces a coupling between the caller and the implementation of the called function. You are no longer free to change the implementation, because then it may throw new errors and your assumption is no longer valid. You had to ask yourself "Where did I assume this function to never fail?" The assert has then become a part of the program logic. What happens when the new error happens in the release version, with the assert not compiled in, and you didn't catch it in the debug version?

Secure 2010-04-16 13:11:17

@Secure, that is a valid point. If all you are writing is pure C, then it would be overkill. For example, on an embedded system, this would be wasted effort. If one is creating a library, though, which one plans to distribute, then, given the popularity of C++, it should not come as a shock that C++ developers may want to use it. As a C++ developer myself, I am biased, and am encouraging readers to make their headers C++ friendly.

Michael Aaron Safyan 2010-04-16 13:26:20

@Secure, you are completely right about the use of the word "should". Some of my wording was a little bit too black-and-white and did not give proper justification for all the decisions. I have changed some of the wording to explain when/why... for example, I changed the bullet point to say that it should use EXTERNC if one intends for the header to be included by C++ code. Thank you for helpful feedback and keeping me honest!

Michael Aaron Safyan 2010-04-16 13:35:11

Answer 3

A:

Is your "update" example complete? I wouldn't think that would compile: it calls for a return value but you never return anything. You never do anything will fullpath, but perhaps that's deliberate, maybe your point is just to say that when you do the malloc, other things break.

Without seeing the caller, it's impossible to say definitively what is happening here. My guess is that paths is a dynamically-allocated block that was free'd before you called this function. Depending on the compiler implementation, a free'd block could still appear to contain valid data until a future malloc takes over the space.

Update: to actually answer the question

String handling is a well-known problem in C. If you create a fixed-size array to hold the string, you have to worry about a long string overflowing the allocated space. This means constantly checking string sizes on copies, using strncpy and strncat instead of the plain strcpy and strcat, or similar techniques. You can skip this and just say, "Well, no one would ever have a name longer than 60 characters" or some such, but there's always the danger then that someone will. Even on something that should have a known size, like a social security number or an ISBN, someone could make a mistake entering it and hit a key twice, or a malicious user could deliberately enter something long. Etc. Of course this is mostly an issue on data entry or reading files. Once you have a string in a field of some known size, then for any copies or other manipulation, you know the size.

The alternative is to use dynamically-allocated buffers where you can make them as big as needed. This sounds like a good solution when you first hear it, but in practice it's a giantic pain in C, because allocating the buffers and freeing them when you no longer need them is a lot of trouble. Another poster here said that the function that allocates a buffer should be the same one that frees it. A good rule of thumb, I generally agree, but ... What if a subroutine wants to return a string? So it allocates the buffer, returns it, and ... how can it free it? It can't because the whole point is that it wants to return it to the caller. The caller can't allocate the buffer because it doesn't know the size. Also, seemingly simple things like:

if (strcmp(getMeSomeString(),stringIWantToCompareItTo)==0) etc

are impossible. If the getMeSomeString function allocates the string, sure, it can return it so we do the compare, but now we've lost the handle and we can never free it. You end up having to write awkward code like

char* someString=getMeSomeString();
int f=strcmp(someString,stringIWantToCompareItTo);
free(someString);
if (f==0)
etc

So okay, it works, but readability just plummetted.

In practice, I've found that when strings can reasonably be expected to be of a knowable size, I allocate fixed-length buffers. If an input is bigger than the buffer, I either truncate it or give an error message, depending on the context. I only resort to dynamically-allocated buffers when the size is potentially large and unpredictable.

Jay 2010-04-15 01:14:36

Yeah, it was just an example to show when it would act strange. I'm going with the approach of using output parameters, and declaring fixed-sized character arrays to pass to functions, and then I read the result from them.

chucknelson 2010-04-15 02:12:31

ansaurus

tags:

views:

answers:

What is the safest way to pass strings around in C?

related questions