views:

492

answers:

9

Hi, this question is just speculative.

I have the following implementation in C++:

using namespace std;

void testvector(int x)
{
  vector<string> v;
  char aux[20];
  int a = x * 2000;
  int z = a + 2000;
  string s("X-");
  for (int i = a; i < z; i++)
  {
    sprintf(aux, "%d", i);
    v.push_back(s + aux);
  }
}

int main()
{
  for (int i = 0; i < 10000; i++)
  {
    if (i % 1000 == 0) cout << i << endl;
    testvector(i);
  }
}

In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).

I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...

So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...

Thank you.

+1  A: 

It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).

sepp2k
+10  A: 

You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.

However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.

In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.

This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.

Daniel Earwicker
And by 'hand-tune out of all recognition' you mean use a custom allocator that allocates from a memory pool?
gnud
Like the ready to go boost pool allocator? It's almost a one line change to get the same performance.
GMan
@gnud - try the comparison for yourself. Dropping in a custom pool allocator will make only a disappointing difference on most implementations, as they already implement malloc by sub-allocating from pools. Modern GCs and JIT languages can take advantage of information about the stack layout of function calls to identify unreachable objects, they can move objects to eradicate fragmentation, they can delay expensive work until after periods of heavy workload. By working at a higher level, they have more ways of optimising. Pointer-based memory manipulation is comparatively harder to optimise.
Daniel Earwicker
A, theoretical, smart enough compiler would realize that the vector is thrown away after it's filled, and optimize testvector() away completely. What I'm getting at, such contrived do-nothing test programs are, literally, meaningless.
Pieter
I can’t agree with your conclusion. There is nothing mythical about C++’s performance. Numerous performance tests demonstrate it time and again. There are a **few** notable niches where C++ has worse performance than other languages/engines. These are most notably strings processing and l12n / Unicode support. These have been well-publicized but they are extremely rare exceptions to an over-arching rule. And that rule still is: for most computation-intense algorithms, C++ is up to an order of magnitude faster than managed languages.
Konrad Rudolph
@earwicker - i managed to cut the c++ version running time to 1/3 by using the boost allocator for the strings, and not re-allocating the vector between every call to `testvector`. See my answer below. Now, I don't claim that c++ is faster than jvm/cli at allocating a new object. But I claim that his way of going about allocating 2000 objects in c++, is pretty silly.
gnud
My C# port (using List<string>) takes 8.7s to run. I think Java just eliminates most of the dead code in testvector. The original C++ version that I fixed to compile took 27s. I easily brought it down to ~6s by calling vector.reserve, eliminating the redundant std::string and using a simpler int->char* function because sprintf was very slow (9s just the sprintf).
rpg
@Konrad - I think you've misunderstood me. Computation intensive algorithms, yes, C++ is iikely to be faster. (But an order of magnitude!? That's fiction. You are aware that managed languages are JITed to native machine code and use inlining?) But here I'm talking about the 90% of modern applications that do lots of small heap object allocations. In that domain, managed environments just do a much better job. C++'s memory model fundamentally limits the extent to which it can compete with the self-optimising power of a generational compacting GC.
Daniel Earwicker
@Earwicker: You seem to have a few misconceptions about managed languages: a) GC is not free, you pay for the allocation speed in memory use, which translates to a speed loss when you take caching into consideration. b) JITing has to be very fast, so it can't optimize as well as a regular compiler. You can convince yourself by doing a port of the C++ code in C#. Both Konrad's Java version and my C# version were outrun by non-naive implementations of the original C++ code...
rpg
@rpg - Again, you aren't hearing what I'm saying. I repeat, up front, C++ will always be *able* to do computation faster than managed languages, if we employ our hard-won understanding of how to use it well for speed. But what of elegant, simple, readable code, which focuses on design concerns and correctness instead of raw speed? That is what most implementors have to prioritise. If they do that and compare the results in standard C++ versus a managed language, the managed version will perform better. And every iota of effort devoted to performance tuning is taken away from other concerns.
Daniel Earwicker
By the way, although GC is not free, neither is the C++ free store. Both add some overhead to each allocated block. In the 32-bit CLR it's 8 bytes. Alexandrescu's chapter on allocators says the typical overhead in C++ is between 4 and 32 bytes. So CLR's hit is toward the low end of the range. And `new`/`delete` are hamstrung by the interface they expose (pointers) so they can't compact the heap, forcing them to do some things in a slower way, and leaving them susceptible to fragmentation.
Daniel Earwicker
One of the arguments in *favour* of a compacting heap is the greater locality and hence better use of CPU cache. See Richter, *CLR via C#*.
Daniel Earwicker
@Earwicker: Don’t underestimate the impact of cache misses on performance. Fetching from RAM is several orders of magnitude slower than loading from cache. Good C++ code can cut down on the number of allocations drastically which enables caching to be used much more efficiently. As Stroustrup has said, “C++ Is my favorite garbage collected language because it generates so little garbage.”
Konrad Rudolph
This whole discussion reminds me of http://stackoverflow.com/questions/918359/my-python-program-executes-faster-than-my-java-version-of-the-same-program-what - different languages have different strenghts and weaknesses, so small, useless examples can produce surprising results to those who don't know much about the internals of each language/vm.
gnud
@Konrad - I `don't` under estimate it - in the comment immediately above I refer to it as a concern that is well served by a compacting GC. And once again, I don't disagree about the power of *good* C++, which is a phrase that implicitly conjours up an expert and an unlimited amount of development time to devote to performance work. I'm talking every day situations where you'd prefer to focus your efforts on correctness, readability and other important values. *By default*, managed environments do a good job with memory. *By default* C++ does not (C++0x with r-value references will be better).
Daniel Earwicker
Jeez, why is it so hard to accept that languages other than C++ can perform well? The man is right, dynamic allocations in C++ are horrendously expensive. Yes, a pool allocator helps, but it doesn't solve the issue entirely. It has fixed capacity, and it doesn't offer the same compaction over time which improves locality. It requires more expensive strategies for freeing memory (such as reference counting).
jalf
Intense loyalty always arises with programming languages, because people have to concentrate hard on a language in order to become experts at it - and never more so than with C++! Unfortunately this limits awareness of how "my" language compares with those "other" languages.
Daniel Earwicker
Languages other than C++ can perform well (especially C# and Java). I was pleasantly surprised with C# on WinMo... However, I doubt that C# and Java are often faster than C++, because that's not at all the impression that I have from customers, benchmarks, my own programming experience and my experience as a user. It doesn't please me that in theory Java can JIT my code to be 25% faster than C++ when I know that have to wait for Eclipse to load or have to wait to create a new document in Visual Paradigm.
rpg
@Earwicker: You left out the memory used by the collector thread and the GC data structures + extra information for .NET classes (reflection, etc). These are non-local accesses that matter. I was looking at some Debian benchmark where Java beat C++ by 2x at speed, but Java's memory use was 10x that of C++! There are always tradeoffs to be made... I agree that over a long period of time, the GC compact routine is bound to have a positive effect on memory use, and the C++ app is likely to have a fragmented heap (poor malloc implementation, naive memory use, etc).
rpg
A: 

What are you actually trying to measure here? Putting ints into a vector?

You can start by pre-allocating space into the vector with the know size of the vector:

instead of:

void testvector(int x)
{
  vector<string> v;
  int a = x * 2000;
  int z = a + 2000;
  string s("X-");
  for (int i = a; i < z; i++)
    v.push_back(i);
}

try:

void testvector(int x)
{
  int a = x * 2000;
  int z = a + 2000;
  string s("X-");
  vector<string> v(z);
  for (int i = a; i < z; i++)
    v.push_back(i);
}
Edison Gustavo Muenz
A: 

In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).

This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

Mike Dunlavey
Sorry, that was an error writing the program here.It is fixed right now.
ebasconp
@ebascomp: Right, but now you have s+i in the inner loop. It looks innocent, but single-step through it at the disassembly level, and you'll see it's doing a lot. Another way to tell is to just pause it under the debugger. My guess is >90% of the time it is in the process of executing `s+i`.
Mike Dunlavey
One you've eliminated the unnecessary string addition, about half the time is spent in sprintf() and half constructing the std::string. Java may well be faster at the second, since std::string is not very efficient. I'm dubious that the cost of sprintf() can be reduced though.
alex tingle
@alex: I agree. If you want to do a whole lot of sprintfs, of numbers, it's going to take time, plain and simple. And making a lot of small strings is not going to be cheap either.
Mike Dunlavey
+2  A: 

To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?

vickirk
It would be interesting to see the Java code, but I doubt calling flush unnecessarily 10 times makes any difference to the performance.
UncleBens
That was only one example, but flushing the output can have a huge impact, flushing forces data to be written which usually means blocking, just writing on a (buffered stream) would come straight back.
vickirk
+3  A: 

Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.

For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.

All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.

Robert Christie
+6  A: 

All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.

You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!

alex tingle
Like "benchmarks" that conclude that Java can complete infinity in zero seconds because the `while(true)` and its body get eliminated because they do nothing :P
JulianR
+2  A: 

In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).

I cannot reproduce that result.

To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.

Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.

I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.

Just for the record, the tests were run on the following configuration:

$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
Konrad Rudolph
Which version of Java did you use on which operating system?
Jesper
@Jesper: I used Sun JVM 6.0 on MacOSX in a MacBook and in Kubuntu 9.04 in an amd64 box.
ebasconp
@Jesper: Good comment, see update.
Konrad Rudolph
+4  A: 

Well, this is a pretty useless test that only measures allocation of small objects. That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:

typedef vector<string, boost::pool_allocator<string> > str_vector;    

void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
    char aux[25] = "X-";
    int a = x * 2000;
    for (; it != end; ++a)
    {
     sprintf(aux+2, "%d", a);
     *it++ = aux;
    }
}

int main(int argc, char** argv)
{
    str_vector v(2000);
    for (int i = 0; i < 10000; i++)
    {
     if (i % 1000 == 0) cout << i << endl;
     testvector(i, v.begin(), v.begin()+2000);
    }
    return 0;
}

real    0m4.089s
user    0m3.686s
sys     0m0.000s

Java version has the times:

real    0m2.923s
user    0m2.490s
sys     0m0.063s

(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).

So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)

gnud
You could also try to get rid of the sprintf, it was a massive time sink in my tests.
rpg
I've removed the silly string object used in the original, and use one char buffer on the stack. I also only overwrite the number each time, not the prefix. I can't imagine sprintf is that much worse than a sstream, if at all. The problem was the original implementation of the string concatenation.
gnud
At one point, I entirely removed the vector and the std::string and I couldn't believe how slow sprintf was. Maybe this is specific to MSVC 9.
rpg
Well, the moment I touch a stringstream, the time goes up by about a second...
gnud