tags:

views:

289

answers:

7

Most high-level languages (Python, Ruby, even Java) use pass-by reference. Obviously, we don't have references in C, but we can imitate them using pointers. There are a few benefits to doing this. For example:

int findChar(char ch, char* in)
{
    int i = 0;
    for(i = 0; in[i] != '\0'; i++)
        if(in[i] == ch)
            return i;
    return -1;
}

This is a common C paradigm: catch an abnormal or erroneous situation by returning some error value (in this case, return -1 if the character is not in the string).

The problem with this is: what if you want to support strings more than 2^31 - 1 characters long? The obvious solution is to return an unsigned int but that won't work with this error value.

The solution is something like this:

unsigned int* findChar(char ch, char* in)
{
    unsigned int i = 0;
    for(i = 0; in[i] != '\0'; i++)
        if(in[i] == ch)
        {
            unsigned int index = (unsigned int*) malloc(sizeof(unsigned int));
            *index = i;
            return index;
        }
    return NULL;
}

There are some obvious optimizations which I didn't make for simplicity's sake, but you get the idea; return NULL as your error value.

If you do this with all your functions, you should also pass your arguments in as pointers, so that you can pass the results of one function to the arguments of another.

Are there any downsides to this approach (besides memory usage) that I'm missing?

EDIT: I'd like to add (if it isn't completely obvious by my question) that I've got some experience in C++, but I'm pretty much a complete beginner at C.

+6  A: 

It is a bad idea because caller is responsible to free the index, otherwise you are leaking memory. Alternatively you can use static int and return its address every time - there will be no leaks, but function becomes non-reentrant, which is risky (but acceptable if you bear it in mind).

Much better would be to return pointer to char function finds, or NULL if it is not present. That's the way strchr() works, BTW.

Edited to reflect changes in original post.

qrdl
That was my intent, it's been corrected.
Imagist
Any kind of utility function like this that was non-reentrant is an accident waiting to happen. I really don't like the 'static int' suggestion!
Roddy
+1 for the rest, though!
Roddy
@Roddy That's why I said it is risky. strerror() (from the top of my mind) isn't reentrant for the very same reason but is still very usable.
qrdl
+1  A: 
  1. The function needs to dereference the parameters, which takes more time than accessing the stack.
  2. The pointers can be uninitialized, causing unexpected results.
  3. There is no standard way to specify which pointer is for input, wich is for output and which is for both (there are extensions, and naming tricks, but it's still a matter).
fortran
+2  A: 

In the specific example, you should use size_t as the return type: this is the data type that adequately represents how large strings can get on any system. I.e. you can't possibly have a string that is longer than a size_t can represent. Then, you can fairly safely use (size_t)-1 as an error indicator: realistically, you also cannot put a string with that size into memory, since you also need some address space for the code you are executing; it becomes a limitation of your API that such long strings would not be supported if they existed.

Your approach not only has the disadvantage using more memory, but also the disadvantage of being slower: the callee needs to malloc, the caller needs to free. Those are fairly expensive operations.

There is one other standard approach relevant here: errno. In case of an error indicator, you don't know what the error is. So in C, rather than using an out parameter, we typically put the error details into a global or thread-local variable.

Martin v. Löwis
Thanks for your input. I was aware of `size_t` but the code above is just an example I threw together to show what I was talking about, so I didn't consider it carefully. Assuming that you mean the maximum value of `size_t` by `(size_t)`, what if you want to support lengths of strings all the way up to the max? I understand that this isn't often a concern but since the reason I'm using C is to write an interpreter for a higher-level language, it's still a concern.
Imagist
By (size_t)-1, I meant the value -1 casted to size_t; and yes, that gives you the maximum value of size_t. As I said, you *really* don't need to support strings of that size, since they cannot possibly fit into memory. For example, on a 32-bit machine, (size_t)-1 is one byte short of 4GB; along with the terminating 0, the string would require 4GB. Add to that the malloc header, and it just won't fit into address space (let alone that you need some space for the code, and that operating systems often don't give you the full 4GB).
Martin v. Löwis
+1  A: 

The biggest downside is that it requires findChar()'s callers to free() the returned memory, or create a memory leak. You've reinvented the strchr() wheel poorly.

I also don't see why you're thinking that returning a pointer to unsigned int is such a big step forward. First, you could just return an unsigned int, if all you're after is the ability to return values up to 2^32 on a 32-bit machine instead of 2^31-1. Second, your stated goal is to avoid a problem with large strings. Well, what if you're on a 64-bit machine, where 'int' and 'unsigned int' remain 32 bits? What you really want here is a long, but returning pointers doesn't actually help here.

ELIDED BOGUS CRITICISM

Warren Young
Imagist
Sorry, not thinking straight. It's 2am, whaddaya want? :)
Warren Young
THank you for your input!
Imagist
+3  A: 

Without the malloc, the position can be still a stack variable and you can use it in an if statement:

int findChar(char ch, char* in, int* pos)
{
    int i = 0;
    for(i = 0; in[i] != '\0'; i++) 
    {
        if(in[i] == ch) 
        {
            *pos = i;
            return 1;
        }
    }

    return 0;
}
Tom
This is findchar is a much, much better API than "int findChar(char ch, char* in)". While not uncommon to mix value and error indication in the return value, it is bad practice since it is a violation of KISS.
hlovdal
See http://lcsd05.cs.tamu.edu/slides/keynote.pdf and http://video.google.com/videoplay?docid=-3733345136856180693 for how to write good APIs. Read about realloc in 'Code Complete' by Steve McConnell for an example of an horrible API.
hlovdal
+1  A: 

I am not an expert, but I think a ton of small mallocs can cause problems. First, you have to take care of freeing the memory after the use of the value. Then you also have to deal with the fragmentation of the free memory. Passing as pointer is more suitable for complex structures.

bandi
+1  A: 

I'd say the most severe downside to your code is that you use one return value to represent both a general failure and the result if successful.

While this is a common practice, it can lead to wierd scenarios when requirements change, just like the one you described. An alternative practice would be to separate the return values, i.e. something like this

int findChar(char ch, char const * const in, unsigned int * const index)
{
    if ( in != NULL && index != NULL)
    {
        unsigned int i;
        for(i = 0; in[i]; i++)
        {
            if(in[i] == ch)
            {
                *index = i;
                return EXIT_SUCCESS;
            }
        }
    }
    return EXIT_FAILURE;
}

...where the function return value tells you whether the function was successful or not, separately from the value of 'index'.

Then again, as fortran noted, there is no way to enforce whether the pointers are input values, output values, or both (i.e. modified inside the function).

Christoffer