tags:

views:

288

answers:

9

I have a generic growing buffer indended to accumulate "random" string pieces and then fetch the result. Code to handle that buffer is written in plain C.

Pseudocode API:

void write(buffer_t * buf, const unsigned char * bytes, size_t len);/* appends */
const unsigned char * buffer(buffer_t * buf);/* returns accumulated data */

I'm thinking about the growth strategy I should pick for that buffer.

I do not know if my users would prefer memory or speed — or what would be the nature of user's data.

I've seen two strategies in the wild: grow buffer by fixed size increments (that is what I've currently implemented) or grow data exponentially. (There is also a strategy to allocate the exact amount of memory needed — but this is not that interesting in my case.)

Perhaps I should let user to pick the strategy... But that would make code a bit more complex...

Once upon a time, Herb Sutter wrote (referencing Andrew Koenig) that the best strategy is, probably, exponential growth with factor 1.5 (search for "Growth Strategy"). Is this still the best choice?

Any advice? What does your experience say?

+6  A: 

The exponential growth strategy is used throughout STL and it seems to work fine. I'd say stick with that at least until you find a definite case where it won't work.

Blindy
Where is it used besides std::vector?
Roger Pate
`std::string`, I think.
Alexander Gladysh
std::vector (and std::string) are the main examples I had too, but they're pretty much the only contiguous memory classes in STL.
Blindy
Ah, std::string makes sense, it's not in the STL though. (Use stdlib if you mean the C++ standard library.) I was thinking the same thing, all of the other containers being non-contiguous.
Roger Pate
Actually, as I remember, std::string is not required to be continuous either...
Alexander Gladysh
@Roger: Could you clarify why `std::string` wouldn't be considered a part of the STL?
GMan
afaik, string is continuous, ropes aren't.
Blindy
string *could* be non-contiguous but the implementation of c_str() and data() would be awful.
Zan Lynx
@GMan: The STL is a specific library "of container classes, algorithms, and iterators" (http://www.sgi.com/tech/stl/stl_introduction.html) which never included basic_string or iostreams. It does not mean all the templates in the stdlib, and the name is confusing that way. Also see http://www.hpl.hp.com/techreports/95/HPL-95-11.html
Roger Pate
@Alexander: Under current C++, string's data has to be contiguous (but not the terminating null char). This is why there are separate data and c_str methods: to allow cheap COW substrings. The C++0x draft states this even more unambiguously: "The char-like objects in a basic_string object shall be stored contiguously." [21.3.1/3 in n2723].
Roger Pate
@Roger Pate: C++ 0x requires contiguous data, but the current one really doesn't. No publicly available implementation does so, but I once modified SGI's rope class to (I believe) meet the requirements (but left it using non-contiguous storage).
Jerry Coffin
@Jerry: op[] is required to return references to contiguous elements for pos < size() because it must give the same object as data()[pos] (which must be contiguous) by 21.3.4/1. I agree 0x states this requirement much more clearly, but it's still there for current C++.
Roger Pate
@Jerry: But you don't have to take my word for it.. trust Matt Austern (who I think is the "I" here): "I don't believe it's possible to write a useful and standard-conforming basic_string that isn't contiguous." http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#530
Roger Pate
@Roger: it's not really there for current C++. As stated in the comment you cited: "We **almost** require contiguity already." (emphasis added). What it comes down to is that it does have to be contiguous from the time `data()` is called until something that invalidates the buffer that returned (e.g. any modification to the string). As for "trust Matt Austern", I certainly do to a degree. When you add "useful", it becomes open to argument. I found the implementation I did useful at one time, but haven't used it in years. I do think guaranteed contiguity is more useful though.
Jerry Coffin
@Jerry: You misread that sentence. A magical implementation that determines you never use op[] would not be required, under the letter but not the spirit of the standard, to store contiguously. I didn't add useful, he did, because magical implementations are not useful. Your own ropes may be worthwhile, but that worth shouldn't depend on naming it "std::string".
Roger Pate
@Roger:no magic is needed. When data (or usually c_str()) is called, you coalesce the data into a contiguous buffer. It remains contiguous until or unless modified. At that point, if the user still has a pointer returned from `data()`, it's invalid. As for a requirement that `operator[]` actually call `data()`, it simply doesn't exist -- since a conforming program can't tell whether `data()` was called, there's no requirement to call it (aka, the "as if" rule).
Jerry Coffin
+2  A: 

The answer, as always is, it "depends".

The idea behind exponential growth - ie allocating a new buffer that is x times the current size is that as you require more buffer, you'll need more buffer ansd the chances are that you'll be needing much more buffer than a small fixed increment provides.

So, if you have a 8-byte buffer, and need more allocating an extra 8 bytes is ok, then allocating an additional 16 bytes is probably a good idea - someone with a 16-byte buffer is not likely to require a extra 1 byte. And if they do, all that's happening is you're wasting a little memory.

I thought the best growth factor was 2 - ie double your buffer, but if Koenig/Sutter say 1.5 is optimal, then I'm agreeing with them. You may want to tweak your growth rate after getting some usage statistics though.

So exponential growth is a good trade-off between performance and keeping memory usage low.

gbjbaanb
It really depends on the nature of the application. Doubling the buffer each time makes it much easier for the memory management system to keep the address space unfragmented, for example. For a long-running application, keeping the space unfragmented (so you avoid wasting memory in too-small-to-be-useful chunks) can be more important than the extra memory you save by using a 1.5 factor instead of 2.
Anon.
+3  A: 

The key point is that the exponential growth strategy lets you avoid expensive copies of the buffer content when you hit the current size for the cost of some wasted memory. The article you link has the numbers for the trade-of.

Nikolai N Fetissov
+1  A: 

There's no way anyone can give good advice without knowing something about the allocations, runtime environment, execution characteristics, etc., etc.

Code which works is way more important than highly optimized code... which is under development. Choose some algorithm—any workable algorithm—and try it! If it proves suboptimal, then change the strategy. Placing this in the control of the library user often does them no favors. But if you already have some option scheme in place, then adding it could be useful, unless you hit on a good algorithm (and n^1.5 is a pretty good one).


Also, the use of a function named write in C (not C++) conflicts with <io.h> and <stdio.h>. It's fine if nothing uses them, but it would also be hard to add them later. Best to use a more descriptive name.

wallyk
`<io.h>` is very MSVC-specific...in Unix, `write` is declared in `<unistd.h>`. :-P But yes, whether or not you include `<unistd.h>` or the like, the fact is that `write` is defined in libc, and making one's own implementation of `write`, especially one that's incompatible with the system one, is not wise.
Chris Jester-Young
That is why I said that this is pseudocode. :-) Actual function is called `lbsSB_write()`. Thanks for the warning anyway.
Alexander Gladysh
@Chris: `<io.h>` is the old standard header file for Unix which contains `read()`, `write()`, `seek()`, etc. Many embedded developer runtime libraries continue this tradition.
wallyk
@wallyk: I'm having difficulty finding this header file in any old version of Unix that I'm looking at (Google isn't having any luck either when searching for `"io.h" Unix`): the versions I'm looking at are in http://minnie.tuhs.org/Archive/PDP-11/Trees/ from the Unix Heritage Society. Note that in neither V6 nor V7 is there an `io.h`; system calls were made directly. The 2.11BSD tree features `unistd.h`.
Chris Jester-Young
See http://stackoverflow.com/questions/2211030/failing-to-compile-a-project-missing-io-h-file
wallyk
+1  A: 
  • Double the size until a threshold (~100MB?) and then lower the exponential growth toward 1.5,.., 1.3
  • Another option would be to make the default buffer size configurable at runtime.
fabrizioM
That is way too complicated for something which has not yet proven to be a problem—let alone an issue. Such complexity will be rather difficult to detect as a cause of errors.
wallyk
+6  A: 

Unless you have a good reason to do otherwise, exponential growth is probably the best choice. Using 1.5 for the exponent isn't really magical, and in fact that's not what Andrew Koenig originally said. What he originally said was that the growth factor should be less than (1+sqrt(5))/2 (~1.6).

Pete Becker says when he was at Dinkumware P.J. Plauger, owner of Dinkumware, says they did some testing and found that 1.5 worked well. When you allocate a block of memory, the allocator will usually allocate a block that's at least slightly larger than you requested to give it room for a little book-keeping information. My guess (though unconfirmed by any testing) is that reducing the factor a little lets the real block size still fit within the limit.

References: I believe Andrew originally published this in a magazine (the Journal of Object Oriented Programming, IIRC) which hasn't been published in years now, so getting a re-print would probably be quite difficult.

Andrew Koenig's Usenet post, and P.J. Plauger's Usenet post.

Jerry Coffin
Cool details, thanks! Do you, by any chance, have any references? I'd love to read them.
Alexander Gladysh
Thank you for the references!
Alexander Gladysh
+1  A: 

I usually use a combination of addition of a small fixed amount and multiplication by 1.5 because it is efficent to implement and leads to reasonable step widths which are bigger at first and more memory sensible when the buffer grows. As fixed offest I usually use the initial size of the buffer and start with rather small initial sizes:

new_size = old_size + ( old_size >> 1 ) + initial_size;

As initial_size I use 4 for collection types, 8, 12 or 16 for string types and 128 to 4096 for in-/output buffers depending on the context.

Here is a little chart that shows that this grows much faster (yellow+red) in the early steps compared to multiplying by 1.5 only (red).

growth chart

So, if you started with 100 you would need for example 6 increases to accommodate 3000 elements while multiplying with 1.5 alone would need 9.

At larger sizes the influence of the addition becomes negligible, which makes both approaches scale equally well by a factor of 1.5 then. These are the effective growth factors if you use the initial size as fixed amount for the addition:

2.5
1.9
1.7
1.62
1.57
1.54
1.53
1.52
1.51
1.5
...
x4u
+1  A: 

The point of using exponential growth (whether the factor be 1.5 or 2) is to avoid copies. Each time you realloc the array, you can trigger an implicit copy of the item, which, of course, gets more expensive the larger it gets. By using an exponential growth, you get an amortized constant number of recopies -- i.e. you rarely end up copying.

As long as you're running on a desktop computer of some kind, you can expect an essentially unlimited amount of memory, so time is probably the right side of that tradeoff. For hard real-time systems, you would probably want to find a way to avoid the copies altogether -- a linked list comes to mind.

Joel
+1  A: 

As a wild idea, for this specific case, you could change the API to require the caller to allocate the memory for each chunk, and then remembering the chunks instead of copying the data.

Then, when it's time to actually produce the result, you know exactly how much memory is going to be needed and can allocate exactly that.

This has the benefit that the caller will need to allocate memory for the chunks anyway, and so you might as well make use of that. This also avoids copying data more than once.

It has the disadvantage that the caller will have to dynamically allocate each chunk. To get around that, you could allocate memory for each chunk, and remember those, rather than keeping one large buffer, which gets resized when it gets full. This way, you'll copy data twice (once into the chunk you allocate, another time into the resulting string), but no more. If you have to resize several times, you may end up with more than two copies.

Further, really large areas of free memory may be difficult for the memory allocator to find. Allocating smaller chunks may well be easier. There might not be space for a one-gigabyte chunk of memory, but there might be space for a thousand megabyte chunks.

Lars Wirzenius