ansaurus

Question

Matrix Multiplication with Threads: Why is it not faster?

Answer 1

+10 A:

Unless you're working with very large matrices (many thousands of rows/columns), then you are unlikely to see much improvement from this approach. Setting up a thread on a modern CPU/OS is actually pretty expensive in relative terms of CPU time, much more time than a few multiply operations.

Also, it's usually not worthwhile to set up more than one thread per CPU core that you have available. If you have, say, only two cores and you set up 2500 threads (for 50x50 matrices), then the OS is going to spend all its time managing and switching between those 2500 threads rather than doing your calculations.

If you were to set up two threads beforehand (still assuming a two-core CPU), keep those threads available all the time waiting for work to do, and supply them with the 2500 dot products you need to calculate in some kind of synchronised work queue, then you might start to see an improvement. However, it still won't ever be more than 50% better than using only one core.

Greg Hewgill 2010-06-06 22:34:16

The one caveat to that being the situation where you have a UI thread and a worker thread.

Chris Thompson 2010-06-06 22:38:35

@Chris Thompson: Your UI thread is unlikely to be using much CPU power. The advantage to having a separate UI thread is to not *block* your UI thread while doing computation, which keeps your UI responsive.

Greg Hewgill 2010-06-06 22:39:57

@Greg, right. That's what I meant :-)

Chris Thompson 2010-06-06 22:40:45

Answer 2

+1 A:

You don't allow much parallel execution: you wait for the thread immediately after creating it, so there is almost no way for your program to use additional CPUs (i.e. it can never use a third CPU/core). Try to allow more threads to run (probably up to the count of cores you have).

brittle 2010-06-06 22:39:56

Answer 3

+3 A:

I'm not entirely sure I understand the source code, but here's what it looks like: You have a loop that runs M*N times. Each time through the loop, you create a thread that fills in one number in the result matrix. But right after you launch the thread, you wait for it to complete. I don't think that you're ever actually running more than one thread.

Even if you were running more than one thread, the thread is doing a trivial amount of work. Even if K was large (you mention 50), 50 multiplications isn't much compared to the cost of starting the thread in the first place. The program should create fewer threads--certainly no more than the number of processors--and assign more work to each.

Willis Blackburn 2010-06-06 22:40:06

Answer 4

+1 A:

If you have a processor with two cores, then you should just divide the work to be done in two halfs and give each thread one half. The same principle if you have 3, 4, 5 cores. The optimal performance design will always match the number of threads to the number of available cores (by available I mean cores that aren't already being heavily used by other processes).

One other thing you have to consider is that each thread must have its data contiguous and independent from the data for the other threads. Otherwise, memcache misses will slow down sighificantly the processing.

To better understand these issues, I'd recommend the book Patterns for Parallel Programming http://astore.amazon.com/amazon-books-20/detail/0321228111

Although its code samples are more directed to OpenMP & MPI, and you're using PThreads, still the first half of the book is very rich in fundamental concepts & inner working of multithreading environments, very useful to avoid most of the performance bottlenecks you'll encounter.

Fabio Ceconello 2010-06-06 23:23:10

Answer 5

A:

Provided the code parallelizes correctly (I won't check it), likely performance boosts only when the code is parallelized in hardware, i.e. threads are really parallel (multi cores, multi cpus, ... other techologies...) and not apparently ("multitasking" way) parallel. Just an idea, I am not sure this is the case.

ShinTakezou 2010-06-07 08:51:41

ansaurus

tags:

views:

answers:

Matrix Multiplication with Threads: Why is it not faster?

related questions