views:

68

answers:

3

Here is what I'm doing:

I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.

The bezier calculation was slowing me down so this is what I did.

I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.

I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?

Thanks

A: 

Multithreading is not going to speed up your process. Processing the data in different cores, could.

devoured elysium
Isn't that what multithreading does :-p
Milo
@devoured: I think the OP isn't stupid. Obviously the point of using multiple threads is to use multiple cores.
Billy ONeal
A: 

If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.

Remember that the overhead of making the threads and such has an impact on total run time.

Finally.. for the copy, what are you using? Is it std::copy?

Billy ONeal
std::insert 4 times
Milo
+3  A: 

Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).

Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.

std::vector<USERPOINT> inputs; // input data   
std::vector<std::vector<POINT> > outputs; // space for output data

const int thread_count = 4;

struct work_packet {           // describe the work for one thread
    USERPOINT *inputs;         // where to get its input
    std::vector<POINT> *outputs;   // where to put its output
    int num_points;                // how many points to process
    HANDLE finished;               // signal when it's done.
};

std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count);       // storage for parent's handle to events

outputs.resize(inputs.size);                    // can't resize output after processing starts.

for (int i=0; i<thread_count; i++) {
    int offset = i * inputs.size() / thread_count;
    packets[i].inputs = &inputs[0]+offset;
    packets[i].outputs = &outputs[0]+offset;
    packets[i].count = inputs.size()/thread_count;
    events[i] = packets[i].done = CreateEvent();

    threads[i].process(&packets[i]);
}


// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);

Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.

Jerry Coffin
+1 but Note: You might have to use pointers to the elements in the vector instead of passing the vector itself here because STL containers can be strange with regards to multithreading behavior.
Billy ONeal
@Billy: Excellent point.
Jerry Coffin
I think @Billy ONeal's concern is important enough that you need to make it part of your answer before I give you a point. A little sample code would also be helpful.
Omnifarious