views:

1291

answers:

12

I've always thought it's the general wisdom that std::vector is "implemented as an array," blah blah blah. Today I went down and tested it, seems to be not so:

Here's some test results:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

That's about 3 - 4 times slower! Doesn't really justify for the "vector may be slower for a few nanosecs" comments.

And the code I used:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
    TestTimer(const std::string & name) : name(name),
        start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
    {
    }

    ~TestTimer()
    {
        using namespace std;
        using namespace boost;

        posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
        posix_time::time_duration d = now - start;

        cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
            " seconds" << endl;
    }

private:
    std::string name;
    boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

Am I doing it wrong or something? Or have I just busted this performance myth?

Edit: I'm using Release mode in MSVS2005.


In MSVC, #define _SECURE_SCL 0 reduces UseVector by half (bringing it down to 4 seconds). This is really huge IMO. Someone please upvote me a couple more times so this gets known by more folks :p

+1  A: 

EDIT: Ok, well this has been a nasty lesson to me. kizzx2 - my apologies for a misguided answer. I - browbeaten - did run the code, and it became obvious - as expected - that the allocation rather than the loop was the issue. In a comment below I mistakenly attributed that to small-object optimisation. Rather than delete this answer, I'll add some insight which I don't believe has been covered explicitly by other answers (though they're evolving all the time, and they do show how to bypass the step in which this unnecessary processing is done):

GNU's STL (and others), given vector<T>(n), default constructs a prototypal object T() - the compiler will optimise away the empty constructor - but then a copy of whatever garbage happened to be in the memory addresses now reserved for the object is taken by the STL's __uninitialized_fill_n_aux, which loops populating copies of that object as the default values in the vector. So, "my" STL is not looping constructing, but constructing then loop/copying. It's counter intuitive, but I should have remembered as I commented on a recent stackoverflow question about this very point: the construct/copy can be more efficient for reference counted objects etc..

So:

vector<T> x(n);

or

vector<T> x;
x.resize(n);

is - on many STL implementations - something like:

T temp;
for (int i = 0; i < n; ++i)
    x[i] = temp;

The issue being that the current generation of compiler optimisers don't seem to work from the insight that temp is uninitialised garbage, and fail to optimise out the loop and default copy constructor invocations. You could credibly argue that compilers absolutely shouldn't optimise this away, as a programmer writing the above has a reasonable expectation that all the objects will be identical after the loop, even if garbage (usual caveats about 'identical'/operator== vs memcmp/operator= etc apply). The compiler can't be expected to have any extra insight into the larger context of std::vector<> or the later usage of the data that would suggest this optimisation safe.

This can be contrasted with the more obvious, direct implementation:

for (int i = 0; i < n; ++i)
    x[i] = T();

Which we can expect a compiler to optimise out.

To be a bit more explicit about the justification for this aspect of vector's behaviour, consider:

std::vector<big_reference_counted_object> x(10000);

Clearly it's a major difference if we make 10000 independent objects versus 10000 referencing the same data. There's a reasonable argument that the advantage of protecting casual C++ users from accidentally doing something so expensive outweights the very small real-world cost of hard-to-optimise copy construction.

ORIGINAL ANSWER (for reference / making sense of the comments): No chance. vector is as fast as an array, at least if you reserve space sensibly. I'm not even going to bother looking at your code... waste of time. Check you've got optimisation on and your compiler's not doing some extra non-Standard checks for you (I think Microsoft likes to "help" people like that sometimes). Find another myth to bust.

Tony
I really cannot justify this answer being anywhere slightly useful to anyone. I hope I could downvote twice.
kizzx2
-1, there goes my support on kizzx2. vector never going to be as fast as array due to the additional feature it provides, rule of universe, everything has a price !
YeenFei
You're missing out, Tony… it is an example of an artificial benchmark, but it does prove what it purports to.
Potatoswatter
Roses are green, violets are orange, the downvotes are bitter, but the answer begs them.
Pavel Minaev
It's obvious that the reason your test harness is slower is that your compiler/library allocator is slower than malloc. Nothing specific to vector at all, and something you can control. C++ allocators are designed for higher performance during frequent allocation and deallocation of smallish objects. They'll do some extra checks before falling back on malloc() to get another chunk of heap. And if my answer seems dismissal, it's provoked by the "blah blah blah" attitude in the question. @kizzx2 It's true you hadn't fallen for one of the more obvious mistakes I listed - credit to you for that.
Tony
+4  A: 

Try disabling checked iterators and building in release mode. You shouldn't see much of a performance difference.

kloffy
Tried `#define _SECURE_SCL 0`. That made `UseVector` somewhere around 4 seconds (similar to `gcc` below) but still it's twice as slow.
kizzx2
This is almost certainly the cause. Microsoft graciously haves us iterator debugging on by default for both debug and release. We found this to be the root cause of a massive slowdown after upgrading from 2003 to 2008. Definitely one of the most pernicious gotchas of visual studio.
Doug T.
@kizzx2 there's another macro to disable: HAS_ITERATOR_DEBUGGING or some such.
Doug T.
As @Martin and my answers show, gcc shows the same pattern, even with optimization at `-O3`.
John Kugelman
@Doug: Looking at the doc, I think `_HAS_ITERATOR_DEBUGGING` is disabled in release build: http://msdn.microsoft.com/en-us/library/aa985939(VS.80).aspx
kizzx2
+1  A: 

Here's how the push_back method in vector works:

  1. The vector allocates X amount of space when it is initialized.
  2. As stated below it checks if there is room in the current underlying array for the item.
  3. It makes a copy of the item in the push_back call.

After calling push_back X items:

  1. The vector reallocates kX amount of space into a 2nd array.
  2. It Copies the entries of the first array onto the second.
  3. Discards the first array.
  4. Now uses the second array as storage until it reaches kX entries.

Repeat. If you're not reserving space its definitely going to be slower. More than that, if it's expensive to copy the item then 'push_back' like that is going to eat you alive.

As to the vector versus array thing, I'm going to have to agree with the other people. Run in release, turn optimizations on, and put in a few more flags so that the friendly people at Microsoft don't #@%$^ it up for ya.

One more thing, if you don't need to resize, use Boost.Array.

wheaties
I understand people don't like to read a bunch of code when it's posted verbatim. But I did use `reserve` like I should.
kizzx2
Sorry I missed it. Was nothing else I put up there helpful at all?
wheaties
`push_back` has amortized constant time. It sounds like you're describing an O(N) process. (Steps 1 and 3 seem completely out of place.) What makes `push_back` slow for OP is the range check to see whether reallocation needs to happen, updating the pointers, the check against NULL inside placement `new`, and other little things that normally get drowned out by the program's actual work.
Potatoswatter
It's going to be slower even with `reserve` as it still has to make that check (whether it needs to realloc) on every `push_back`.
Pavel Minaev
All good points. What I'm describing sounds like an O(N) process but it's not, well not quite. Most people I know do not understand how a `vector` performs it's resize functionality, it's just "magic." Here, let me clarify it a bit more.
wheaties
+49  A: 

Using the following:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds

So array is twice as quick as vector. But this is expected as you run across the vector twice and the array only once. Remember when you resize() the vector you are not only allocating the memory but also running through the vector and calling the constructor on each member.

If we modify the array to run over the array twice. First time to default initialization then second time to do the special initialization. Then we would expect the times to be more similar.

So modify the Array loop:

    // STUFF as before.
    Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
    // Initialization to make it do the same work as new.
    for(int i = 0 ; i < dimension * dimension; ++i)
    {   pixels[i]   = Pixel();
    }

    for(int i = 0 ; i < dimension * dimension; ++i)
    {
    // STUFF as before

The time now looks like this:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 3.772 seconds
UseVector completed in 4.105 seconds
UseVectorPushBack completed in 8.035 seconds
The whole thing completed in 15.914 seconds

You are doing 2 000 000 000 array accesses (((Array 1000 * 1000) * loop 1000) * 2 (Construction + Init loops)). The time difference is 0.333 seconds. Or a difference of 0.0000000001665 secs per array access. Assuming a 2.33 GHz Processor like mine that's 0.4 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.

Arguing that you need to write an allocator to get comparable size is bunkum. Just use the vectors constructor where you pass the initialization value. Also rather than creating the array and resizing do it in one step. This will save a lot of hassle (only allocating space once) with the dynamic memory management library (which you are doing a thousand times in the original).

So the array version of the code is:

 Pixel* pixels = (Pixels*)malloc(sizeof(Pixel) * dimensions * dimensions);
 for(int loop =0 ;loop < dimensions * dimensions; ++loop)
 {
      pixels[loop] = Pixel(255,0,0);
 }

The vector version of the same code is:

 std::vector<Pixel>  pixels(dimensions * dimensions, Pixel(255,0,0));

Note: Now we are only doing 1,000,000,000 array accesses.

Now doing the same timing again:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.261 seconds
UseVector completed in 2.236 seconds

A slightly better performance for the vector (but insignificant and could be caused by any number of factors that we can see from this course of a timing factor).

Replacing the malloc with new (and thus not requiring the loop for initialization) improves the array version a bit more. But note using new here only allows us to use the default constructor so we can initialize the array with a specific value (so it is slightly faster but less flexible).

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.044 seconds
UseVector completed in 2.236 seconds

Martin York
I really appreciate the fact that you actually ran the code :)
kizzx2
I was aware of the fact that I waste initializations with `resize()` -- that's why I added the `push_back` version (which doesn't do "wasteful" initializations but turned out to be slower). I'm not aware of how you can simply get a vector of uninitialized bunch of memory like `malloc`
kizzx2
@kizzx2: You need to use `reserve()` instead of `resize()`. This allocates space for the objects (that is, it changes the _capacity_ of the vector) but does not create the objects (that is, the _size_ of the vector is left unchanged).
James McNellis
Supplying your own allocator as a template argument may also help in getting vector to just grab its memory in big uninitialized heap chunks.
Crashworks
@James: That's what I tried to do in `UseVectorPushBack`. Turned out to be even slower.
kizzx2
@Crashworks: I tried to look at it. Seems a total overkill for a simple scenario. Is there a built in `null_allocator` or something?
kizzx2
So I guess `vector` would be "almost" as fast as arrays for long living objects. For short living objects, use arrays to eliminate the cost of initialization (unless you wanna write your own STL allocator)
kizzx2
Honestly, kizzx2, in the realtime graphics world we just don't use the STL much in performance sensitive code. I've never been able to get the code gen to be identical to a simple array (so that advancing and dereferencing an iterator works out to exactly one add op and one load op).
Crashworks
You are doing 1 000 000 000 array accesses. The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses.
Martin York
Giving g++ the `-S` command line option will let you see exactly the assembly it is emitting.
Crashworks
Calling `Pixel()` in a loop is not entirely fair. Change the malloc + loop into `new Pixel[dimension * dimension]` and now the array loop is back to its previous fast speed.
John Kugelman
`Pixels::Pixels() : r(255),g(0),b(0) {}` This is just wrong and synthesized to get `vector` more favorable test result. It wastes the valuable default constructor and now if I in another loop want all green pixels I'm going to do some kung-fu trick. If I used `malloc` it's just another initializer loop.
kizzx2
The final result looks more in line with popular opinions. But I wouldn't say "done correctly vector is faster," though. By 0.025 seconds :P And that's at the price of wasting the default constructor of a contained object *for the sake of the container*, or writing an STL allocator.
kizzx2
@Martin "You are doing 1 000 000 000 array accesses. The time difference is 0.333 seconds. Or a difference of 0.000000000333 per array access. Assuming a 2.33 GHz Processor like mine that's 0.7 instruction pipeline stages per array accesses. So the vector looks like it is using one extra instruction per accesses." You sound like you want to exaggerate the numbers, but I guess looping over a bunch of 1000 * 1000 pixmap is a very common operation in any graphic program.
kizzx2
It is, but we never do it in a linear componentwise way like this -- it's almost always done in SIMD, either as a shader program or as an unrolled loop of intrinsics on the CPU, processing multiple pixels at a time. At this level of performance, even the pipeline bubble introduced by the loop branch is significant. (In fact I'm at work trying to fix some branch mispredicts in our renderer right now.)
Crashworks
@Kizzx2: I revised my opinion on the phrase "Done correctly". But I just hacked together constructor for demonstration purposes. I don;t think writing a default constructor that initializes the array correctly would be that hard.
Martin York
@Martin: The point is "initialize the array correctly" it's "initialize the array for a specific purpose," which is not good. In our example, `Pixel()`'s default constructor is locked down to produce red, which is not good. Without typing it all out, imagine if our tests consisted of 3 loop: one making a bunch of red, one making a bnuch of green, one making a bunch of blue. The malloc version would scale but the default constructor version would only be faster in the Red version.
kizzx2
@kizzx2: There is an easy way around that. Give me 20 minutes to look it up. Then we can have a clean Pixel constructor and still get our cool initialization.
Martin York
@kizzx2: Fixed it now. The default constructor is back to normal. Of course now you can use new for the array version so it is back to being slower.
Martin York
@James McNellis: You can't just replace `resize()` with `reserve()`, because this does not adjust the vector's internal idea of its own size, so subsequent writes to its elements are technically "writing past the end" and will produce UB. Although in practice every STL implementation will "behave itself" in that regard, how do you resync the vector's size? If you try calling `resize()` *after* populating the vector, it will quite possibly overwrite all those elements with default-constructed values!
j_random_hacker
-1 from me, because to me this answer attempts to hide a genuine weakness of C++: initialisation is unwieldy and oftentimes unnecessarily inefficient. I like C++ and I'm glad to see these issues being addressed in C++1x, but we should fess up to genuine criticisms of the status quo.
j_random_hacker
@kizzx2: Its 1,000,000,000 array accesses as you are doing an array of 1000*1000 elements AND then you have another loop (inside the function) that is repeating the operation 1000 times.
Martin York
@j_random_hacker: Isn't that exactly what I said? I thought I was very clear that `reserve` only changes the capacity of a vector, not its size.
James McNellis
@j_random_hacker: So you give me -1 rather than the language. OK I am not going to argue about push_back() that is obviously much less efficient. But normal initialization seems to be just as efficient as for an array as it does a vector (which shows how good the compilers are).
Martin York
@James: My point is that using `reserve()` instead of `replace()` as you suggest is useless, *because* the former does not change the size -- that's a crucial piece of information about the vector! Choosing to go this route implies that the program can't trust the value of `pixels.size()` afterwards.
j_random_hacker
@Martin: Well I would -1 the language on this if I could... :) The root of the problem as I see it is that, for the desired program behaviour, it *should not be necessary* to construct/assign elements in the vector more than once. You achieve parity with the non-STL version by making it worse (i.e. do unnecessary work), not making the STL version better.
j_random_hacker
@Martin: That is, except for your suggestion to change the default ctor for `Pixel`, which is on the right track as it does make the STL version faster, but it has a negative global impact. (I would instead suggest using the 2-arg form of the vector ctor.) I've written an answer that does only construct/assign each element once by using `back_inserter()`. It's much uglier than the non-STL code though.
j_random_hacker
@j_random_hacker: I don't think you've kept up with Martin's edits. He's already done just what you suggested.
Fred Larson
@Fred: You're right, edited 3 times since I looked at it! Martin now suggests using the 2-arg vector ctor which is good. -1 reverted.
j_random_hacker
What's really interesting is that VC++ ended up being slower with `assign()` than it is with `reserve()`+`push_back()`. Now that's just WTF.
Pavel Minaev
... and ditto for using the assign-like constructor. Another WTF. Poking in disassembly now, but I think this is already bug-filing-worthy. I'll forward this to Stephan tomorrow.
Pavel Minaev
Okay, go figure. There was a lot of exception-related cruft in vector methods. Adding `/EHsc` to compilation switches cleaned that up, and `assign()` actually beats array now. Yay.
Pavel Minaev
@Pavel: Interesting, I hadn't considered the effects of exception-handlnig overheads. I do all my VC++ compilation with `/EHsc`.
j_random_hacker
Also mote that the cost of exceptions is practically undetectable when exceptions are not actually thrown and when they are effeciency us no longer your major concern.
Martin York
Simply compiling with exceptions enabled induces a measurable performance overhead compared to compiling with them removed in MSVC, regardless of whether you throw them or not. Try it.
Crashworks
@Crashworks: It is very true for 32-bit Windows, but much less for 64-bit.
Nemanja Trifunovic
@Crashworks: Not that I don;t believe you but maybe you should ask a question as to why that is (just like this question asked why are vectors slower and was proved wrong). I don't know if you are wright or wrong not having done the tests but from all the articles I have read they indicate the difference from using exceptions is not measurable.
Martin York
Can't find it right now, but somewhere on SO there's a question about overheads for exception handling when no exceptions are thrown where I argued that there must be nonzero overhead, since disassemblies show MSVC++ adds some unless it can prove no exceptions can occur. But it turned out that g++'s overhead genuinely was zero for the examples I tried -- quite an engineering feat I think! (Zero execution time overhead that is.)
j_random_hacker
I don't think that `for(...) { pixels[i] = Pixel(); }` is quite the same as `pixels.resize(...)`. The former makes N default constructor calls and N assignment operator calls, and the latter makes 1 default constructor call and N copy constructor calls. Not that it really matters from a functional standpoint, considering that both versions initialize the pixels to uninitialized garbage. But it does point out a difference from `new Pixel[...]`, which calls the default constructor N times and never uses the copy constructor or assignment operator to copy uninitialized garbage around.
bk1e
@bk1e: Interesting, didn't realise thought `resize()` took a (defaultable) 2nd parameter! I expect they'll all compile to almost identical machine code if optimisations are turned on, because of the "as-if" rule, but yes they're technically distinct operations.
j_random_hacker
+2  A: 

The vector ones are additionally calling Pixel constructors.

Each is causing almost a million ctor runs that you're timing.

edit: then there's the outer 1...1000 loop, so make that a billion ctor calls!

edit 2: it'd be interesting to see the disassembly for the UseArray case. An optimizer could optimize the whole thing away, since it has no effect other than burning CPU.

Graham Perks
You're right, but the question is: how can these pointless ctor calls be turned off? It's easy for the non-STL approach, but difficult/ugly for the STL way.
j_random_hacker
A: 

well because vector::resize() does much more processing than plain memory allocation (by malloc). try put a breakpoint in your copy constructor (define it so that you can breakpoint !) and there goes the additional processing time.

YeenFei
+10  A: 

Great question. I came in here expecting to find some simple fix that would speed the vector tests right up. That didn't work out quite like I expected!

Optimization helps, but it's not enough. With optimization on I'm still seeing a 2X performance difference between UseArray and UseVector. Interestingly, UseVector was significantly slower than UseVectorPushBack without optimization.

# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds

Idea #1 - Use new[] instead of malloc

I tried changing malloc() to new[] in UseArray so the objects would get constructed. And changing from individual field assignment to assigning a Pixel instance. Oh, and renaming the inner loop variable to j.

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {   
        int dimension = 999;

        // Same speed as malloc().
        Pixel * pixels = new Pixel[dimension * dimension];

        for(int j = 0 ; j < dimension * dimension; ++j)
            pixels[j] = Pixel(255, 0, 0);

        delete[] pixels;
    }
}

Surprisingly (to me), none of those changes made any difference whatsoever. Not even the change to new[] which will default construct all of the Pixels. It seems that gcc can optimize out the default constructor calls when using new[], but not when using vector.

Idea #2 - Remove repeated operator[] calls

I also attempted to get rid of the triple operator[] lookup and cache the reference to pixels[j]. That actually slowed UseVector down! Oops.

for(int j = 0; j < dimension * dimension; ++j)
{
    // Slower than accessing pixels[j] three times.
    Pixel &pixel = pixels[j];
    pixel.r = 255;
    pixel.g = 0;
    pixel.b = 0;
}

# ./vector 
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds

Idea #3 - Remove constructors

What about removing the constructors entirely? Then perhaps gcc can optimize out the construction of all of the objects when the vectors are created. What happens if we change Pixel to:

struct Pixel
{
    unsigned char r, g, b;
};

Result: about 10% faster. Still slower than an array. Hm.

# ./vector 
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds

Idea #4 - Use iterator instead of loop index

How about using a vector<Pixel>::iterator instead of a loop index?

for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
    j->r = 255;
    j->g = 0;
    j->b = 0;
}

Result:

# ./vector 
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds

Nope, no different. At least it's not slower. I thought this would have performance similar to #2 where I used a Pixel& reference.

Conclusion

Even if some smart cookie figures out how to make the vector loop as fast as the array one, this does not speak well of the default behavior of std::vector. So much for the compiler being smart enough to optimize out all the C++ness and make STL containers as fast as raw arrays.

The bottom line is that the compiler is unable to optimize away the no-op default constructor calls when using std::vector. If you use plain new[] it optimizes them away just fine. But not with std::vector. Even if you can rewrite your code to eliminate the constructor calls that flies in face of the mantra around here: "The compiler is smarter than you. The STL is just as fast as plain C. Don't worry about it."

John Kugelman
Again, thanks for actually running the code. It's sometimes easy to get bashed without reasons when someone attempts to challenge popular opinions.
kizzx2
"So much for the compiler being smart enough to optimize out all the C++ness and make STL containers as fast as raw arrays." Nice comments. I have a theory that this "compiler is smart" is just a myth -- C++ parsing is extremely hard and the compiler is just a machine.
kizzx2
You can see @Martin's answer. I think my test is a little synthesized and stressed the initialization part. For long-living arrays that aren't performance surgery-kind critical, `vector` probably justify it for the extra benefits it brings.
kizzx2
I dunno. Sure, he was able to *slow down* the array test, but he didn't *speed up* the vector one. I edited in above where I removed the constructors from Pixel and made it a simple struct, and it was still slow. That's bad news for anyone who uses simple types like `vector<int>`.
John Kugelman
I wish I really could upvote your answer twice. Smart ideas to try out (even though none really worked) that I couldn't even think of!
kizzx2
As for idea #2, it's not surprising. I did some hard-core optimization before and for primitive types like `unsigned char`, it's faster for the CPU and the optimizer to juggle that in the register than in memory. If you used a reference, sometimes things are forced to go through a memory address.
kizzx2
I don't think you actually removed the constructor in your test. Doesn't the compiler generated default constructor do a default construction on each member? You'd have to define a default constructor that explicitly does nothing to make a valid test.
Mark Ransom
@Mark kizzx2's original code has a do-nothing default constructor. When I removed it there was a 10% speedup. The compiler generated default constructor will default construct each member, yes, but default constructing `unsigned char`s leaves them uninitialized.
John Kugelman
@John Kugelman: for idea #2, did you try a `const` reference? That may give the compiler the hint it needs to prevent unnecessary memory fetches.
Evan Teran
Just wanted to make a note that complexity of _parsing_ C++ (which is insanely complex, yes) has nothing to do with quality of optimization. The latter usually happens on stages where parse result is already transformed numerous times to a much more low-level representation.
Pavel Minaev
@Evan: A const ref won't work, since we need to assign to the object's members.
j_random_hacker
@j_random_hacker: ah, you are right, i didn't read the example code carefully enough.
Evan Teran
+1  A: 

some profiler data (pixel is aligned to 32 bits)

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

blah

andrey@nv:~$ opannotate --source libcchem/src/a.out  | grep "Total samples for file" -A3
Overflow stats not available
 * Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
 *
 * 141008 52.5367
 */
--
 * Total samples for file : "/home/andrey/libcchem/src/test.cpp"
 *
 *  61556 22.9345
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
 *
 *  41956 15.6320
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
 *
 *  20956  7.8078
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
 *
 *   2923  1.0891
 */

in allocator:

               :      // _GLIBCXX_RESOLVE_LIB_DEFECTS
               :      // 402. wrong new expression in [some_] allocator::construct
               :      void
               :      construct(pointer __p, const _Tp& __val)
141008 52.5367 :      { ::new((void *)__p) _Tp(__val); }

vector

               :void UseVector()
               :{ /* UseVector() total:  60121 22.3999 */
...
               :
               :
 10790  4.0201 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
   495  0.1844 :            pixels[i].r = 255;
               :
 12618  4.7012 :            pixels[i].g = 0;
               :
  2253  0.8394 :            pixels[i].b = 0;
               :
               :        }

array

               :void UseArray()
               :{ /* UseArray() total:  35191 13.1114 */
               :
...
               :
   136  0.0507 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
  9897  3.6874 :            pixels[i].r = 255;
               :
  3511  1.3081 :            pixels[i].g = 0;
               :
 21647  8.0652 :            pixels[i].b = 0;

most of overhead is in copy constructor: for example

    std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

    pixels.reserve(dimension * dimension);


    for (int i = 0; i < dimension * dimension; ++i) {

        pixels[i].r = 255;

        pixels[i].g = 0;

        pixels[i].b = 0;

    }

Has the same performance as array

aaa
+2  A: 

Martin York's answer bothers me because it seems like an attempt to brush the initialisation problem under the carpet. But he is right to identify redundant default construction as the source of performance problems.

[EDIT: Martin's answer no longer suggests changing the default constructor.]

For the immediate problem at hand, you could certainly call the 2-parameter version of the vector<Pixel> ctor instead:

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

That works if you want to initialise with a constant value, which is a common case. But the more general problem is: How can you efficiently initialise with something more complicated than a constant value?

For this you can use a back_insert_iterator, which is an iterator adaptor. Here's an example with a vector of ints, although the general idea works just as well for Pixels:

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
    squares() { i = 0; }
    int operator()() const { ++i; return i * i; }

private:
    int i;
};

...

std::vector<int> v;
v.reserve(someSize);     // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

Alternatively you could use copy() or transform() instead of generate_n().

The downside is that the logic to construct the initial values needs to be moved into a separate class, which is less convenient than having it in-place (although lambdas in C++1x make this much nicer). Also I expect this will still not be as fast as a malloc()-based non-STL version, but I expect it will be close, since it only does one construction for each element.

j_random_hacker
+2  A: 

To be fair, you cannot compare a C++ implementation to a C implementation, as I would call your malloc version. malloc does not create objects - it only allocates raw memory. That you then treat that memory as objects without calling the constructor is poor C++ (possibly invalid - I'll leave that to the language lawyers).

That said, simply changing the malloc to new Pixel[dimensions*dimensions] and free to delete [] pixels does not make much difference with the simple implementation of Pixel that you have. Here's the results on my box (E6600, 64-bit):

UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds

But with a slight change, the tables turn:

Pixel.h

struct Pixel
{
    Pixel();
    Pixel(unsigned char r, unsigned char g, unsigned char b);

    unsigned char r, g, b;
};

Pixel.cc

#include "Pixel.h"

Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b) 
  : r(r), g(g), b(b) {}

main.cc

#include "Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]

Compiled this way:

$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o

we get very different results:

UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds

With a non-inlined constructor for Pixel, std::vector now beats a raw array.

It would appear that the complexity of allocation through std::vector and std:allocator is too much to be optimised as effectively as a simple new Pixel[n]. However, we can see that the problem is simply with the allocation not the vector access by tweaking a couple of the test functions to create the vector/array once by moving it outside the loop:

void UseVector()
{
    TestTimer t("UseVector");

    int dimension = 999;
    std::vector<Pixel> pixels;
    pixels.resize(dimension * dimension);

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

and

void UseArray()
{
    TestTimer t("UseArray");

    int dimension = 999;
    Pixel * pixels = new Pixel[dimension * dimension];

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
    delete [] pixels;
}

We get these results now:

UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds

What we can learn from this is that std::vector is comparable to a raw array for access, but if you need to create and delete the vector/array many times, creating a complex object will be more time consuming that creating a simple array when the element's constructor is not inlined. I don't think that this is very surprising.

camh
+5  A: 

Try with this:

void UseVectorCtor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
    }
}

I get almost exactly the same performance as with array.

The thing about vector is that it's a much more general tool than an array. And that means you have to consider how you use it. It can be used in a lot of different ways, providing functionality that an array doesn't even have. And if you use it "wrong" for your purpose, you incur a lot of overhead, but if you use it correctly, it is usually basically a zero-overhead data structure. In this case, the problem is that you separately initialized the vector (causing all elements to have their default ctor called), and then overwriting each element individually with the correct value. That is much harder for the compiler to optimize away than when you do the same thing with an array. Which is why the vector provides a constructor which lets you do exactly that: initialize N elements with value X.

And when you use that, the vector is just as fast as an array.

So no, you haven't busted the performance myth. But you have shown that it's only true if you use the vector optimally, which is a pretty good point too. :)

On the bright side, it's really the simplest usage that turns out to be fastest. If you contrast my code snippet (a single line) with John Kugelman's answer, containing heaps and heaps of tweaks and optimizations, which still don't quite eliminate the performance difference, it's pretty clear that vector is pretty cleverly designed after all. You don't have to jump through hoops to get speed equal to an array. On the contrary, you have to use the simplest possible solution.

jalf
I still question whether this is a fair comparison. If you're getting rid of the inner loop then the array equivalent would be to construct a single Pixel object and then blit that across the entire array.
John Kugelman
Using `new[]` performs the same default constructions that `vector.resize()` does, yet it is much faster. `new[]` + inner loop should be the same speed as `vector.resize()` + inner loop, but it's not, it's nearly twice as fast.
John Kugelman
@John: It is a fair comparison. In the original code, the array is allocated with `malloc` which doesn't initialize or construct anything, so it *is* effectively a single-pass algorithm just like my `vector` sample. And as for `new[]` the answer is obviously that both require two passes, but in the `new[]` case, the compiler is able to optimize that additional overhead away, which it doesn't do in the `vector` case. But I don't see why it is interesting what happens in suboptimal cases. If you care about performance, you don't write code like that.
jalf
@John: Interesting comment. If I wanted to blit across the entire array, I guess array is again the optimal solution -- since I can't tell `vector::resize()` to give me a contingous chunk of memory without it wasting time calling useless constructors.
kizzx2
@kizzx2: yes and no. An array is normally initialized as well in C++. In C, you'd use `malloc` which doesn't perform initialization, but that won't work in C++ with non-POD types. So in the general case, a C++ array would be just as bad. Perhaps the question is, if you're going to perform this blitting often, wouldn't you reuse the same array/vector? And if you do that, then you only pay the cost of "useless constructors" once, at the very start. The actual blitting is just as fast after all.
jalf
But it is a good example. Like I said in my post, the standard library tries to be general. If you only need to store POD types **and** you need to perform something like blitting **and** for some reason you recreate the array each time **and** the size of the array is fixed, **then** a raw array allocated with `malloc` might be a better solution. (But avoiding the reallocation entirely would be better still, and would pretty much negate the performance penalty of the `vector`)
jalf
@jalf: What did you mean by "that won't work in C++ with non-POD types"? Do you mean struct with pointers in them? For me, though, I don't really see a reason why `vector` can't give me the option to just turn off the constructor if I don't want it. The decision was made so that it stores things contagious and `operator[]` works like array. Wonder why they left out this last bit :/
kizzx2
A: 

With the right options, vectors and arrays can generate identical asm. In these cases, they are of course the same speed, because you get the same executable file either way.

Roger Pate
In this case, they do not seem to generate the same assembly. In particular, there seems to be no way to suppress the call to constructors using vectors. You can refer to the answers here for that problem (it's a long read but it does explain why there is a performance difference in cases other than the simples test case in the link you provdeded.) (actually, there seems to be a way -- writing a custom STL allocator, as suggested. Personally, I find it unnecessarily more work than using malloc)
kizzx2
@kizzx2: That you have to go to such lengths to use *unconstructed* objects is a good thing, because that's an error 99% (I may be grossly underestimating) of the time. I did read the other answers, and I realize I don't address your specific situation (no need, the other answers are correct), but I still wanted to provide you with this example of how vectors and arrays can behave exactly the same.
Roger Pate
@Roger: that's great! Thanks for the link
kizzx2