views:

540

answers:

5

Hi, I have been working on a project and trying to find the source of a large slowdown in execution time and have narrowed it down to a single method which I have managed to optimise out of the logic. The problem is that my solution involves using a reference which makes another section of the code run quite slowly... The question I'd like answered is why the inner loop takes so much longer to evaluate when the map is a reference as opposed to a local variable?

Here's the old way prior to optimisation:

// old method: create an empty map, populate it
// and then assign it back to the path object later
map<int,float> screenline_usage; 

for (int i=0; i<numCandidates; ++i)
{
  // timing starts here. 
  map<int, float>& my_screenline_usage =
    path->get_combined_screenline_usage(legnr, stop_id);
  map<int, float>::iterator it = my_screenline_usage.begin(); 
  for (; it != my_screenline_usage.end(); ++it) 
    screenline_usage[it->first] += usage * it->second; 
  // timing ends here, this block evaluated 4 million times for overall execution time of ~12 seconds
}

// This function call is evaluated 400k times for an overall execution time of ~126 seconds
path->set_zone_screenline_usage(access_mode, zone_id, screenline_usage); 

// TOTAL EXECUTION TIME: 138 seconds.

New way after optimisation:

// new method: get a reference to internal path mapping and populate it
map<int, float>& screenline_usage =
  path->get_zone_screenline_usage(access_mode, zone_id);
screenline_usage.clear();

for (int i=0; i<numCandidates; ++i)
{
  // timing starts here
  map<int, float>& my_screenline_usage =
    path->get_combined_screenline_usage(legnr, stop_id);
  map<int, float>::iterator it = my_screenline_usage.begin(); 
  for (; it != my_screenline_usage.end(); ++it) 
    screenline_usage[it->first] += usage * it->second; 
  // timing ends here, this block evaluated 4 million times for overall execution time of ~76 seconds
}

// New method... no need to assign back to path object (0 seconds execution :)
// TOTAL EXECUTION TIME: 76 seconds (62 second time saving) ... but should be able to do it in just 12 seconds if the use of reference didn't add so much time :(

Here are the pertinent subroutines called from that code:

// This is the really slow routine, due to the copy assignment used. 
void set_zone_screenline_usage(int access_mode, int zone_id,
  map<int,float>& screenline_usage)
{
  m_container[access_mode][zone_id] = screenline_usage; 
}

map<int,float>& get_zone_screenline_usage(int access_mode, int zone_id)
{
  return m_container[access_mode][zone_id]; 
}

NOTES: Timing information is for a single run in which the above code is evaluated approximately 400k times. The timing is done using some classes that I built to access the RDTSC time stamp counter (yes i know TSC means time stamp counter), the average value of numCandidates is 10 and the average number of elements put into the screenline_usage map is 25.


UPDATE: Firstly thanks to everyone who has gotten involved here. I think that in the end this had nothing to do c++ references at all and had more to do with cache consistency. I have replaced the optimised code above with a vector& and a hash function implemented as a member variable map

// newest method: get a reference to internal path mapping (as vector) and populate it 
// map<int,int> m_linkNum_to_SlNum declared in header and populated in constructor. 
vector<float>& screenline_usage = path->get_zone_screenline_usage(access_mode, zone_id);

for (int i=0; i<numCandidates; ++i)
{
  // timing starts here
  map<int, float>& my_screenline_usage =
    path->get_combined_screenline_usage(legnr, stop_id);
  map<int, float>::iterator it = my_screenline_usage.begin(); 
  for (; it != my_screenline_usage.end(); ++it) 
    screenline_usage[m_linkNum_to_SlNum[it->first]] += usage * it->second; 
  // timing ends here, this block evaluated 4 million times for overall execution time of ~9 seconds
}

// Newest method... again no need to assign back to path object (0 seconds execution :)
// TOTAL EXECUTION TIME: just 9 seconds (129 second time saving) ... this is even better than using a locally constructed map which took 12 seconds in the inner loop :)

It seems to me here that, given that the vector isn't local but is a contiguous block of memory and that the hashing function (m_linkNum_to_SlNum) is a local member variable, this approach leads to code/data that is able to fit into cache without having to go out to main memory for data resulting in the significant speed up. Other conclusions given these findings are greatly appreciated.

+4  A: 

Maybe you C++ compiler is able to inline some code for the local map, but not when the map is a reference.

Alex Martelli
If that's the case then you can see it by compiling with /FAcs (MSVC) or -c -g -Wa,-a,-ad (GCC).
Crashworks
This seems the most plausible solution however when I tried out the /FAcs flag I'm not able to tell conclusively (mainly because I'm not an assembler guru). I've identified the line in question and there were some small differences in the generated asm in terms of offsets (ie "mov eax, DWORD PTR [-208+ebp]" became -224+ebp for the reference) the only significant change was that "lea ecx, DWORD PTR [-336+ebp]" in the local variable version became "map mov ecx, DWORD PTR [-124+ebp]" in the reference version
Jamie Cook
A: 

The problem I'm trying to figure out is why the inner loop takes so much longer to evaluate when the map is a reference?

I guess that what's taking a long time is the call to get_zone_screenline_usage: and that the reason why that's taking a long time is that the specific element doesn't already exist in m_container and must therefore be created and inserted before it can be returned.

ChrisW
+3  A: 

Your question is unclear. Do you mean to ask, why is passing a map by reference faster than passing by value? If so, the answer is simple: returning a map by value means copying it in its entirety, and copying a large map is a very expensive operation.

On the other hand, if your question is: why is faster to get a reference to an already existing map and populate that than it is to make a new map, then one hypothesis is that it has to do with

 screenline_usage[it->first] += usage * it->second;

Since key [it->first] already exists in the map inside path->get_zone_screenline_usage, then this is is simply an update operation and requires no memory allocation. However, if screenline_usage is an empty map, then each access to a new [it->first] means that it first has to allocate a new map node from the heap, and that's expensive.

Crashworks
I'll go with this one :)
Magnus Skog
Actually the empty local variable map is significantly faster to evaluate than the reference and it was this discrepency that I'm trying to understand. I've updated the question to reflect this.
Jamie Cook
Well, rather than make any more guesses, my best suggestion would be to hook up a sampling profiler and see where a tight loop on the slow algorithm spends its time.
Crashworks
+2  A: 

If I understand your question correctly, you imply that inserting to a local stack-allocated map<> is much faster than inserting to an existing heap-allocated map<> that you retrieve by reference.

There are several possibilities that may be affecting performance. I doubt it has anything to do with C++ reference performance, though.

The first possibility, is that by changing screenline_usage to a reference, you are "essentially" retrieving a pointer to an existing object. The actual instance of the object may not be map<int,float>, it could by anything that inherits from map. For instance, it could be a map with a custom comparator function defined for it. Or a subclass with thread-syncronization logic. Since you don't know what type you are getting from the call to m_container[access_mode][zone_id], you may very well be getting a subclass that does not perform well on insert. (You can see what type you get back in the debugger, by the way.) Whereas, by creating an empty map<int,float> you have a guarantee that the type is an actual map<>, and not a subclass.

Let's assume that you are getting an actual map<> instance back.

The second possibility, is that in the particular flavor of STL that you are using, the map::clear() function does not efficiently clear out internal data structures used to maintain the associative indexes. As I recall, stl::map<> uses some complex internal hashing and bucket structures to provide efficient insert and retrieval capabilities. However, it's unclear what happens to those when you call clear(). So it's possible that screenline_usage.clear() results in a less efficient map to insert against than if you started with an empty map.

Finally, let's assume that I'm wrong, and it's neither of these possibilities. There is a simple way to determine whether the difference is the result of using a reference variable. You can try allocating a new map on the heap and assigning it to a reference in your code, like this:

map<int, float>& screenline_usage = new map<int,float>();

This will help you determine if there is some difference in inserting to an existing map vs. a new map, or if it is indeed the fact that screenline_usage is a reference that is affecting performance. BTW, Don't forget to free this heap-allocated map, or you'll end up with a memory leak.

LBushkin
ChrisW
You're right. But I want to provide an example that could evaluate the performance difference of references (if any) in the OP question.
LBushkin
Just to correct some wrong info here: stl::map is an ordered container, and is usually implemented as red-black tree. What you are describing above is a hashtable, or unordered_map in TR1.
Nikolai N Fetissov
I don't think that your 'first possibility' logic is valid. std::map is not an interface and none of its functions are virtual. The comparator is part of its type (template parameter) and can't be influenced by inheritance. A reference to a std::map could be a reference to a base class sub-object of a larger type but this is irrelevant to performance as the data structure and methods invoked on the object will be identical in either case as there can be no polymorphic behaviour through the reference.
Charles Bailey
+1  A: 

References are most often behind the scenes implemented as pointers. An exception to this would be if a temporary is assigned to them , then the lifetime of the value is extended to the lifetime of the reference; in essence, the reference is the object. So, it depends if:

get_combined_screenline_usage

returns by-reference or by-value. If it is by reference, the reference way might be faster. If it is by value, then the old way and new way are doing essentially the same thing, assuming the compiler does a return value optimization.

In either case, other optimizations the compiler does (for instance inlining) might mask the effects of the two choices; effectively, without knowing which exact line your slow part is, it makes this a bit of a guessing game.

I'd suggest trying to get finer grain information, and narrow it down to exactly the line that is taking so long, it makes optimizing a lot easier.

(as a note:

map<int, float>::iterator it = my_screenline_usage.begin(); 
for (; it != my_screenline_usage.end(); ++it)

might be more efficient if written as

for (map<int, float>::const_iterator it(my_screenline_usage.begin()), end(my_screenline_usage.begin()); it != end; ++it)

)

Todd Gardner