views:

190

answers:

2

I've implemented the Barnes-Hut gravity algorithm in C as follows:

  1. Build a tree of clustered stars.
  2. For each star, traverse the tree and apply the gravitational forces from each applicable node.
  3. Update the star velocities and positions.

Stage 2 is the most expensive stage, and so is implemented in parallel by dividing the set of stars. E.g. with 1000 stars and 2 threads, I have one thread processing the first 500 stars and the second thread processing the second 500.

In practice this works: it speeds the computation by about 30% with two threads on a two-core machine, compared to the non-threaded version. Additionally, it yields the same numerical results as the original non-threaded version.

My concern is that the two threads are accessing the same resource (namely, the tree) simultaneously. I have not added any synchronisation to the thread workers, so it's likely they will attempt to read from the same location at some point. Although access to the tree is strictly read-only I am not 100% sure it's safe. It has worked when I've tested it but I know this is no guarantee of correctness!

Questions

  • Do I need to make a private copy of the tree for each thread?
  • Even if it is safe, are there performance problems of accessing the same memory from multiple threads?

Update Benchmark results for the curious:

Machine: Intel Atom CPU N270 @ 1.60GHz, cpu MHz 800, cache size 512 KB

Threads      real      user      sys
      0    69.056    67.324    1.720
      1    76.821    66.268    5.296
      2    50.272    63.608   10.585
      3    55.510    55.907   13.169
      4    49.789    43.291   29.838
      5    54.245    41.423   31.094

0 means no threading at all; 1 and above means spawn that many worker threads and for the main thread to wait for them. I would not expect much of an improvement for anything beyond 2 threads, since it's entirely CPU bound and that's how many cores there are. It's interesting that an odd number of threads is slightly worse than an even number.

Looking at sys it's apparent that there's a cost with making threads. Currently it's making the threads for each frame (so N*1000 thread creations). This was easy to program (during my 15 minutes on the train this morning). I'll need to think a bit about how to reuse threads...

Update #2 I've made it use a pool of threads, synchronised with two barriers. This has no noticeable performance advantage over recreating the threads each frame.

+3  A: 

If your data is read-only, then no, you do not need to make a private copy of the tree for each thread. This is the biggest advantage that a shared memory threading model offers!

I'm not aware of any performance problems with such a model. If anything, it should be faster depending on if your CPUs can share some of their cache.

Chris AtLee
+2  A: 

You don't specify how your data is structured, but in general reading memory from multiple threads simultaneously is safe and does not introduce any performance issues. You only get problems if someone is writing.

It is interesting that you say you're only getting 30% speedup out of two threads. If you have an otherwise idle machine, two or more CPUs and only readonly shared data (i.e. no synchronization) I would expect to see much closer to 50% speed improvement. This suggests that your operation is actually completing so quickly that the overhead of creating the thread is becoming significant in your numbers. Are you running on a hyperthreaded CPU?

Stewart
I suspect it's Amdahl's law -- I've only parallelised one part of the process that originally took about 80-90% of the runtime. The other thing is that, yes, I am creating the threads anew each time. I haven't benchmarked how expensive this is.
Edmund