views:

246

answers:

8

Hi all,

I've written a simple library file with a function for reading lines from a file of any size. The function is called by passing in a stack-allocated buffer and size, but if the line is too big, a special heap-allocated buffer is initialized and used to pass back a larger line.

This heap-allocated buffer is function-scoped and declared static, initialized to NULL at the beginning of course. I've written in some checks at the beginning of the function, to check if the heap buffer is non-null; if this is the case, then the previous line read was too long. Naturally, I free the heap buffer and set it back to NULL, thinking that the next read will likely only need to fill the stack-allocated buffer (it should be very rare to see lines over 1MB long, even in our application!).

I've gone over the code and tested it fairly thoroughly, both by reading it carefully and by running a few tests. I am reasonably confident that the following invariant is maintained:

  • The heap buffer will be null (and will not leak any memory) on function return if the stack buffer is all that is needed.
  • If the heap buffer is not null, because it was needed, it will be freed on the next function call (and possibly reused if needed on that next line).

But I've thought of a potential problem: If the last line in a file is too long, then since the function is presumably not called again, I'm not sure I have any way to free the heap buffer-- it is function-scoped, after all.

So my question is, how do I go about freeing dynamically allocated memory in a function-scoped static pointer, ideally without calling the function again? (And ideally without making it a global variable, either!)

Code available on request. (I just haven't got access now, sorry. And I'm hoping the question is sufficiently general and well-explained for it not to be needed, but by all means feel free to disabuse me of that notion!)


EDIT: I feel I should add a couple of notes about the usage of the function.

This particular function is used in the form of lines being read serially from a file, and then immediately copied into POD structs, one line per struct. Those are created on the heap as the file is read, and each one of those structs has a char pointer containing (a cleaned up version of) a line from the file. In order for these to persist, a copy already has to occur. (That was one of the big counterarguments brought up in many of the answers-- oh no, the line needs to be COPIED, oh dearie me).

As for multithreading, as I said this is designed to be used serially. No, it isn't thread safe, but I don't care.

Thanks for the multitude of responses, though! I'll read them more thoroughly when I get time. Currently, I'm leaning towards either passing an extra pointer around or redesigning the function so that when fgets shows EOF, then I might just build the freeing logic there instead and the user hopefully won't need to worry about it.

+1  A: 

Aside from the difficulty of freeing that dynamically allocated buffer, there is another potential problem. It is not thread safe. Since it is a library function, then there is always the possibility that it will be used in a multi-threaded environment in the future.

It would probably be better to require the calling function to free the buffer via a related library function.

Mark Wilkins
The calling function is itself a "read-line" sort of function that is called elsewhere. I don't feel comfortable going further up in the calling tree to insert a library call there-- I want this line-reading operation to be as transparent to higher levels of abstraction as possible... But you're onto something, for sure. I'll have to ponder this one carefully.
Platinum Azure
I do see that it would be desirable to make it as simple as possible for the callers. It seems, though, that avoiding some kind of global memory that magically exists somewhere is best - otherwise it usually leads to some difficult problems to debug. Maybe a "readline" structure could encapsulate it (assuming straight C). It could have the stack variable and a dynamic pointer. The readline function could use that structure but it would live in the caller's scope. It would need a cleanup call before leaving the function that uses "readline" but it would simplify the usage.
Mark Wilkins
A: 

There's a few hacks I can think of, although both require moving the static declaration out of the function. I can't imagine why that would be a problem.

Using a GCC extension,

static char *buffer;
void use_buffer(size_t n) {
    buffer = realloc(buffer, n);
}
void cleanup_buffer() __attribute__((destructor)) {
    free(buffer);
}

Using C++,

static char *buffer;
static class buffer_guard {
    ~buffer_guard() { free(buffer); }
} my_buffer_guard;

In any case, I don't really like the design. In C, usually the caller is responsible for allocating/freeing memory that it needs to use, even if it's filled in by a callee.

BTW, compare with Glibc's nonstandard getline. It never uses static memory.

ephemient
I believe this would make the buffer global and he specifically requested that it not be global.
Brian T Hannan
It's statically scoped -- it is only visible within the single translation unit (source file), and not global to the whole program.
ephemient
Neither compiler-specific hacks nor C++ are usable here, unfortunately.
Platinum Azure
+1  A: 

Instead of function scope, give it module scope (i.e. at file scope, but static, so it's not visible outside that file. Add a small function that frees the buffer, and use atexit() to assure that's called before the program exits. Alternative, don't worry about it -- a leak that happens only once, and is freed automatically as the program exits isn't particularly harmful.

I feel obliged to say that the design sounds to me like a recipe for disaster though. When you free the buffer, there's virtually no way to even guess whether it might still be in use. The user (apparently) has to keep track of where the data was returned, and copy the data to a new buffer if (and only if) you allocated one dynamically. In a multi-threading environment, you need to make the internal pointer thread-local to have any chance of working correctly at all. To the user, the function might do one of two entirely different things -- either return a buffer that's owned by the user, OR return a buffer that's owned by the function, and can only be used safely by allocating another buffer, and copying the data into the other buffer before the function is called again.

Jerry Coffin
+1 for highlighting the ownership issue. I think you mean `atexit()` not `onexit()` though.
j_random_hacker
Oops -- thanks for pointing out my mistake. I've fixed it now.
Jerry Coffin
+1  A: 

That could still be okay if you use the standard technique to indicate end-of-file (i.e have you read-line function return NULL).

What happens in this case is that after the final line is read, one more call to your read-line function will be needed so that it can return NULL to indicate that the end of file has been reached. In this last call, you can then free you buffer.

R Samuel Klatchko
+1  A: 

Two choices that occur immediately:

  1. Make the pointer to the heap-allocated buffer static but file scoped. Add a (static) function that checks if it is not null and if it is not null free()s it. Call atexit(free_func) at the start of the program, where free_func is the static function. You can have some global setup routine (caled by main()) where this is done.

  2. Don't worry about it; heap-allocated memory is released by the OS when your process exits, and the memory leak is not cumulative, so even if your program has a long life it won't raise an OOM exception (unless you have some other bug).

I assume your app is NOT multithreaded; in this case, you should not use a static buffer at all, or you should use thread-local data.

Tim Schaeffer
+3  A: 

If you can change the function, I would recommend changing the function interface itself. I know you have spent a lot of time debugging and testing it, but there are a few problems with your current implementation:

  • it is not thread-safe,
  • the user has no control over the data, so he must copy it if he needs it later, most likely in a buffer that's going to be malloc()ed, thus nullifying any advantage you got by the selective use of malloc() in your function,
  • most importantly, as you have discovered, a special action has to be taken by the user for a long last line.

Your users should not be worried by the implementation oddity of your function, they should be able to "just use it".

Unless you are doing it for educational purposes, I would recommend looking at this page, which has one implementation of "reading an arbitrarily long line from a stream", and links to other such implementations (each implementation is slightly different from the others, so you should be able to find one that you like).

Based upon your edit, MT-safe is not a requirement, and a copy is always going to happen. So, the most obvious design is one of the two:

  • Let the user supply a char **, which points to a buffer that your function will allocate, using a combination of malloc() and realloc() (if needed). It is the user's responsibility to free() it when done. That way, the user doesn't have to copy the data again, since he can pass a pointer to wherever the final destination of the data is.
  • return a char * that is allocated by your function. Again, it's the user's responsibility to free() it.

Both are pretty much equivalent.

For your current implementation, you can always return "not end of file" if the last line is very long, and doesn't end in a newline. Then, the user is going to call your function again, and then you can free your buffer. Personally, I would be happier with a function that allows me to read as many lines as I want, and not force me to go to the end of file.

Alok
j_random_hacker
@j_random_hacker: thanks. I feel that my post could be edited a bit for coherency, but that's mostly because of the edit by the OP. I will try to fix it if I get some time.
Alok
+1 for the xref URL
Jonathan Leffler
A: 

I was just going to comment below Mark's answer, but it may feel a little bit cramped. Still, this answer is in essence a comment on his answer, which I find very good in addition to being quick :).

Not only is your function not MT-safe, but even without threads, the interface to use it correctly is complicated. The caller must have finished with the previous result before calling the function again. If this code is still in use two years from now, someone will scratch his head trying to use it right... or worse, use it wrong without even thinking about it. That person could even be you...

Mark's suggestion (requiring the caller to free the buffer) is IMHO the most reasonable. But perhaps you don't trust malloc and free not to cause fragmentation in the long run, or have some other reason to prefer the static buffer solution. In this case you can keep the static buffer for ordinary-length lines, define a boolean flag that indicates if the static buffer is currently busy, and document that the following function (and not free) should be called with the address of the buffer when the caller no longer uses it:

char static_buffer[512];
int buffer_busy;

void free_buffer(char *p)
{
  if (p == static_buffer)
  {
     assert(buffer_busy);
     buffer_busy=0;
  }
  else free(p);
}

char *get_line(...)
{
  char *result;
  if (..short line..)
  {
     result = static_buffer;
     assert(!buffer_busy);
     buffer_busy=1;
  }
  else result = malloc(...);
  ...
  return result;
}

The only circumstances in which the assertions will trigger are circumstances in which your previous implementation would have silently gone wrong, and the overhead is very low compared to your existing solution (only toggling the flag, and asking the caller to call free_buffer when he's finished, which is cleaner). If the assertion in get_line in particular triggers, it means you needed dynamic allocation after all, because the caller could not be finished with a buffer at the time he was asking for another.

Note: this is still not MT-safe.

Pascal Cuoq
+1  A: 

The interface you have chosen makes this an unsolvable problem:

  • The client must not know if the return value points to static or dynamic memory.

  • The return value must point to memory that outlives the call.

  • Any call might be the last.

I'm not sure why you are troubled by this leak. After all, if the client reads a very long line, does something with the line, then does a ton of computation and allocation before reading the next line, you still have a big hunk of memory sitting around unused, clogging up the system. If this OK with you (arbitrary computation takes place before memory is reclaimed), you could just fess up that you're willing to retain dead memory indefinitely.

If you can't live with the leak, the simplest thing to do is to widen the interface so that the client can notify your function when the client is done with the memory. (Right now the contract with the client says that the client owns the memory until it calls your function again, at which point ownership reverts to your function.) Of course, to change the interface means either

  • adding a new function, which would require you to promote your pointer to be static but local to the compilation unit, or

  • adding some argument to the existing function (or overloading an argument) so that you have a call which means "I am done with your memory now, but I don't want another line".

A more radical change would be to rewrite the function to use dynamically allocated memory throughout its lifetime, gradually enlarging the block as needed until it is as large as the largest block ever read (or perhaps rounded up to the next power of two). Depending on actual cases this strategy may consume less address space than keeping a big static buffer.

In any case I'm not convinced you should be worrying about this corner case. If you think this case matters, please edit your question to show us the evidence.

Norman Ramsey
+1 for pointing out that, even forgetting about the last-line case, your current design leaves the client with no way to reclaim the potentially huge chunk of memory allocated by your function -- all it can do is call it again and hope for a shorter line this time.
j_random_hacker