views:

282

answers:

3

Background: So I'm working on a raytracer.. for my construction of the spatial partitioning scheme, I initially had some code like this:

if (msize <= 2) { // create a leaf node
    Model **models = new Model*[msize];
    for (uint i=0; i<msize; ++i)
        models[i] = &mlist[i];
    *arrayPtr = Node(models, msize); // class Node contains a copy of models
    ... increment arrayPtr ...
    return;
}

Basically, after this spatial partitioning tree is constructed, rays traverse the tree looking for Models, which are all stored in one big array. Leaf Nodes contain pointers to an array of pointers of Models.

Then I realized that hey, there's no reason for me to add that extra level of indirection; if I arrange my Models properly, I can get the leaf Nodes to point directly to the big array of models. Models adjacent to each other in the big array would all belong to a given leaf node, so the leaves would contain pointers to Models. So I did this, and tested it with everything else held constant.

Now one would think that this would obviously speed up the program. Well, it does speed up the single-threaded version (by about 10%), but it slows down the multithreaded one (by about 15%! Which is quite significant if you're doing heavy optimizing.) I'm quite at a loss on how to tackle this -- I thought indirection was bad, I thought reducing memory usage was good, especially for multithreading.. there's no writing whatsoever to the leaf node or the Model, all writing is done to a separate data structure.

Any pointers / suggestions on how to analyze the problem would be great.

Some miscellaneous statistics: cachegrind tells me that there are less instruction refs / cache misses for the double-indirection approach, but more data refs / cache misses. The difference isn't that large though, for both.

Edit: As requested, the data structure that I am concerned with:

class Node {
    ushort type;
    union {
        ushort axisID;
        ushort childrenSize;
    };
    union {
        Model **models;
        Node *rightChild;
    };
    float leftPlane, rightPlane;
    ... public methods and stuff ...
}

I basically change Model **models to Model *models, and then I get the speed dip. Class Model itself contains a pointer to two abstract classes, Shape and Material. All the classes mentioned here are block-allocated, with the exception of Material since at the moment I'm just using one.

+1  A: 

My first guess is that you are running into false-sharing. If you have multiple threads both modifying memory in the same cache line, the hardware is going to spend a lot of time passing ownership of the cache line between processors.

R Samuel Klatchko
but my Nodes and Models are both block-allocated... hmm.. how would false sharing happen in this case?
int3
As I understand it though, neither thread modifies the data in question, which would rule out false sharing. It was my first thought too though.
jalf
What about read contention for reading the same cache-line during the same bus cycle? I think the L1 cache is single-ported.
Heath Hunnicutt
A: 

The biggest thing I'd look for is some improper initializtion that either makes duplicate data or has improper shared data. It's not evident in the code but it's the obvious mistake to make when going from ** to *.

Charles Eli Cheese
+1  A: 

Another has questioned whether the slow-down comes from the added indirection, or the change in how you allocate the struct Model. Because you are now allocating the Model structs as a contiguous region of memory, it is possible that adjacent structs could share the same cache-line. If your threads are simultaneously accessing adjacent structs, the will contend for access. One read access will stall for a bus cycle whilst awaiting the other.

What is the sizeof(class Model)? You might try expanding it with dummy variables until the class is the sizeof your cache line.

Another possibility is that you have changed the alignment of member variables you are accessing. If your sizeof(class Model) is not a multiple of your machine's word size (e.g., 8-bytes) then an array of such objects will have some members aligned to the word size and some not. Misalignment causes double-fetching on the memory bus, as the fetch unit reads the machine words from aligned memory locations and composites the addressed value out of those two fetches.

Heath Hunnicutt