Hi, I have been working on a project and trying to find the source of a large slowdown in execution time and have narrowed it down to a single method which I have managed to optimise out of the logic. The problem is that my solution involves using a reference which makes another section of the code run quite slowly... The question I'd like answered is why the inner loop takes so much longer to evaluate when the map is a reference as opposed to a local variable?
Here's the old way prior to optimisation:
// old method: create an empty map, populate it
// and then assign it back to the path object later
map<int,float> screenline_usage;
for (int i=0; i<numCandidates; ++i)
{
// timing starts here.
map<int, float>& my_screenline_usage =
path->get_combined_screenline_usage(legnr, stop_id);
map<int, float>::iterator it = my_screenline_usage.begin();
for (; it != my_screenline_usage.end(); ++it)
screenline_usage[it->first] += usage * it->second;
// timing ends here, this block evaluated 4 million times for overall execution time of ~12 seconds
}
// This function call is evaluated 400k times for an overall execution time of ~126 seconds
path->set_zone_screenline_usage(access_mode, zone_id, screenline_usage);
// TOTAL EXECUTION TIME: 138 seconds.
New way after optimisation:
// new method: get a reference to internal path mapping and populate it
map<int, float>& screenline_usage =
path->get_zone_screenline_usage(access_mode, zone_id);
screenline_usage.clear();
for (int i=0; i<numCandidates; ++i)
{
// timing starts here
map<int, float>& my_screenline_usage =
path->get_combined_screenline_usage(legnr, stop_id);
map<int, float>::iterator it = my_screenline_usage.begin();
for (; it != my_screenline_usage.end(); ++it)
screenline_usage[it->first] += usage * it->second;
// timing ends here, this block evaluated 4 million times for overall execution time of ~76 seconds
}
// New method... no need to assign back to path object (0 seconds execution :)
// TOTAL EXECUTION TIME: 76 seconds (62 second time saving) ... but should be able to do it in just 12 seconds if the use of reference didn't add so much time :(
Here are the pertinent subroutines called from that code:
// This is the really slow routine, due to the copy assignment used.
void set_zone_screenline_usage(int access_mode, int zone_id,
map<int,float>& screenline_usage)
{
m_container[access_mode][zone_id] = screenline_usage;
}
map<int,float>& get_zone_screenline_usage(int access_mode, int zone_id)
{
return m_container[access_mode][zone_id];
}
NOTES: Timing information is for a single run in which the above code is evaluated approximately 400k times. The timing is done using some classes that I built to access the RDTSC time stamp counter (yes i know TSC means time stamp counter), the average value of numCandidates is 10 and the average number of elements put into the screenline_usage map is 25.
UPDATE: Firstly thanks to everyone who has gotten involved here. I think that in the end this had nothing to do c++ references at all and had more to do with cache consistency. I have replaced the optimised code above with a vector& and a hash function implemented as a member variable map
// newest method: get a reference to internal path mapping (as vector) and populate it
// map<int,int> m_linkNum_to_SlNum declared in header and populated in constructor.
vector<float>& screenline_usage = path->get_zone_screenline_usage(access_mode, zone_id);
for (int i=0; i<numCandidates; ++i)
{
// timing starts here
map<int, float>& my_screenline_usage =
path->get_combined_screenline_usage(legnr, stop_id);
map<int, float>::iterator it = my_screenline_usage.begin();
for (; it != my_screenline_usage.end(); ++it)
screenline_usage[m_linkNum_to_SlNum[it->first]] += usage * it->second;
// timing ends here, this block evaluated 4 million times for overall execution time of ~9 seconds
}
// Newest method... again no need to assign back to path object (0 seconds execution :)
// TOTAL EXECUTION TIME: just 9 seconds (129 second time saving) ... this is even better than using a locally constructed map which took 12 seconds in the inner loop :)
It seems to me here that, given that the vector isn't local but is a contiguous block of memory and that the hashing function (m_linkNum_to_SlNum) is a local member variable, this approach leads to code/data that is able to fit into cache without having to go out to main memory for data resulting in the significant speed up. Other conclusions given these findings are greatly appreciated.