views:

534

answers:

5

I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:

Vector of classes

class A
{
  double count;
  double val;
  double sumA;
  double sumB;

  vector<double> sumVectorC;
  vector<double> sumVectorD;
}

vector<A> classes(10000);

The code that needs to run as fast as possible is something like this:

vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
  vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

The alternative is instead of one giant loop, split the computation into two separate loops such as:

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
 vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

or to store each member of the class in a vector like so:

Class of vectors

vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...

and then operate as:

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classCounts[i];
  ...
}

Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?

Update

The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.

+7  A: 

As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.

Empirical data usually beats speculation.

lothar
A: 

As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.

Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).

Not Sure
+2  A: 

As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).

As a general guideline: consider algorithmic improvements before low-level implementation optimisation.

j_random_hacker
+1  A: 

It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).

KeithB
A: 

if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:

 array[i]<test_val

and then an equality test.

then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.

of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6

Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :

class AggregatedArray : public boost::array<double>{
   size_t _size;
   size_t size() const {
      return size;
   }
   ....
   push_back(..){...
   pop(){...
   resize(...){...
};

this gets rid of the extra cache line access to the allocated array data for sumVectorC.

In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.

Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line

Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)

So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries

This what I meant when I said: "It is an engineering problem and not a theoretical one."