views:

1204

answers:

19

Hello everyone,

I am new to programming in general so please keep that in mind when you answer my question.

I have a program that takes a large 3D array (1 billion elements) and sums up elements along the various axis to produce a 2D array of a projection of each side of the data. The problem here is that it is very ram intensive as the program is constantly fetching information from the ram, both reading and writing.

The question is, will i gain any performance increases if i multithread the program or will I end up running into a RAM access bottleneck? When i say multithreading, i only mean multithreading for 2 or 4 cores, no more.

If it helps, my current computer configuration is 2.4ghz core2 quad, 1033 fsb, 4gb ram at 667mhz.

Thanks in advance,

-Faken

Edit:

It seems to me that people here are much more interested in this question that I had first expected. I'll expand the question and post some code for those who are interested.

First of all, a little background on me so you understand where I'm coming from. I am a mechanical engineering graduate student who some how managed to pick a topic that pretty much had nothing to do with mechanical engineering. I have taken 1 course in introductory java (forced) approximately 5 years ago and have never touched programming until about a month ago when i started my thesis in earnest. I have also taken (again forced, still don't know why) a course in electronics and computer engineering, we dealt with micro-controllers (8-bit), their inner workings, and some ASM coding for them. Other than that, I know next to nothing about programming.

Here is the code:

int dim = 1000;
int steps = 7 //ranges from 1 to  255

for (int stage = 1; stage < steps; stage++)
for (int j = 0; j < dim; j++)
 for (int i = 0; i < dim; i++)
 {
  sum = 0;
  for (int k = 0; k < dim; k++)
   if (partMap[(((i * dim) + k) * dim) + j] >= stage)
    sum++;

  projection[(j*dim) + i] = sum;
 }

This section of code operates on the z-axis only. The main data, due to the way it was constructed, has a strange addressing system but you don't need to worry about that. There is also other code for doing the projections of other sides of the cube but they do very different things.

+4  A: 

It's impossible to tell, in general, because you did not specify how fast your CPU and RAM are. Good chances are that it will improve things, because I can't imagine how even 4 threads summing in parallel would saturate RAM enough that it would become a bottleneck (and not the CPU).

Pavel Minaev
I did state the CPU and ram speed...
Faken
Even so, experimentation is probably the only way. You've got a multi-core machine, so I'd guess you can improve the speed. It depends how intensive the calculation is compared to the cost of getting data from ram to cpu cache and back.
Scott Langham
+3  A: 

My gut says you'll see modest improvements. However, predicting the results of optimizations is a notoriously error prone affair.

Try it and benchmark the results.

Kevin Montrose
Heh, i would if i knew what i was doing :) The reason i ask is to see if its worth my time to learn how to mutithread to begin with. If most people say I will see no real improvement, then i shouldn't waste my time on it, after all, I am a beginner programmer, new concepts come slowly if you don't have the background.
Faken
Multithreading is a pretty important thing to "get", and there's no time like the present to learn it. :)
Kevin Montrose
+2  A: 

Multithreading will only make your code faster if the computations can be broken down into chunks that can be worked on independently and concurrently.


EDIT

I said the above (it's almost an automatic response) because I see many developers spend a lot of time on multithreading code for no performance increase at all. Of course, then they end up with the same (or even slower performance) and the extra complications of managing the multiple threads.

Yes, it does appear after reading your question again and taking into account your specific case you would benefit from multithreading.

RAM is very fast, so I think it would be very hard to saturate the memory bandwidth unless you have many, many threads.

Dana Holt
I agree: certain tasks are appropriate for multithreading, certain aren't
CrazyJugglerDrummer
and this problem is
kenny
My application is defiantly mutithreadable, actually I guess it would be considered "embarrassingly parallel" as each operation can be done independently of each other and furthermore, read and write can be done at the same time without interearing with each other because each "operation" of my code is operating on a separate set of data and writing to something that that nothing else would touch.The question is not if it is mutithreadable, but rather if i will hit a ram access bottleneck if i do so.
Faken
The threads are not independent so they may interfere with each other due to the sharing of the data structure. I assume the the data is in a shared heap or other thread-global region and not that each thread has a copy of the data it needs, like row or column of the data which would be unwise for this isolated use of the data. Just saying multi-threading may not certainly be the way to approach the problem.
jeffD
A: 

If you can divide the array in a way that the threads don't write/read to/from the same positions in the array it should increase your speed.

Janusz
+2  A: 

If, and this is a big IF, it is coded appropriately you will most definitely see a speed up. Now as one of my professors always noted, people often try to take an algorithm, thread it and in the end it is slower. This is often because of inefficient synchronization. So basically if you feel like delving into threading (I honestly wouldn't suggest it if you are new to programming) have a go.

In your particular case the synchronization could be quite straightforward. This is to say, you could assign each thread to a quadrant of the large 3-d matrix, where each thread is guaranteed to have sole access to a specific area of the input and output matrices, thus there is no real need to 'protect' the data from multiple access/writes.

In summary, in this specific simple case threading may be quite easy, but in general synchronization when done poorly can cause the program to take longer. It really all depends.

DeusAduro
+1  A: 

If you're partitionning your data correctly then yes, you will have a boost in performance. If you check your cpu usage right now, one core will be at 100% and the 3 others should be close to 0%

It all depends on how well you structure your threads and memory usage.

Also, do not expect a x4 improvement. x4 is the maximum achievable, it will always be lower than that depending on a lot of factors.

Eric
Yea, i think i get it. Yes, 1 core is at 100% load while the rest just sit there. I guess that means my ram bandwidth is not being used fully otherwise my one core on the CPU would be less than 100% while it was waiting for data from the ram. So basically my performance will be increased depending on how much ram access overhead i have left.
Faken
@Faken - Not so. 100% CPU consumption means that the idle loop isn't running at all for the measured interval. The OS can't schedule around stalls due to RAM, so any delays due to memory are not measurable.I believe vTune can give you info on latencies due to RAM.
Michael
A: 

Absolutely. At least getting each core on a thread to work on your problem concurrently will help. It's not clear if more threads would help, but it's possible.

kenny
+1  A: 

Your computer system typically has some elements that limit the rough performance. Which part is your limiting elements, depends on the concrete situation. Normally one of the following factors may be the cause of your performance problems.

  • Disk I/O bandwidth: In most enterprise applications the sheer size of data processed requires it to be stored in some database. Acessing this data may be slowed down by both: the maximum transfer speed, but very often the biggest impact will be caused by a big number of small disk accesses reading some blocks here and there. The you will see the latency time of the heads of the disks moving around and even the time the disk requires for a full rotation may limit your application. Long times ago i had a real problem using some expansive SUN E430 installation that was outperformed by my small NeXTstation... It was the constant fsync()ing of my database which was slowed down by disks not caching write accesses (for good reason). Normally you can speed up your system by adding additional disks to get more I/O per second. Dedicating your drives to specific tasks may even do better in some cases.

  • Network Latency: nearly everything that affects application speed said for disks is equivalent for Network I/O.

  • RAM: If your RAM is not big enough to store your complete application image you need to store it on an external disks. Therefore the Disk I/O slowdown bites you again.

  • CPU processing speed (either Integer or floating point): CPU processing power is the next factor that is a limit for CPU intensive tasks. A CPU has a physical speed limit that cannot be outreached. The only way to speed up is to add more CPU.

These limits may help you to find an answer for your specific problem.

Do you need simply more processing power and your system has more than one CPU or Core? In that case multithreading will improve your performance.

Do you observe significant Network or Disk Latency? If you see this, your valuable CPU might throw away CPU cycles waiting for some slow I/O. If more that one thread is active, this thread might find all data required for processing in memory and could pick up these otherwise wasted CPU cycles.

Therefore you need to observe your existing application. try to extimate the memory bandwidth of the data shuffled around. If the application is active on one CPU below 100%, you might have reached the memory bandwidth limit. In that case, additional threading will do no good for you because this does not give you mor bandwidth from memory.

If the CPU is at 100%, give it a try, but have a look at the algorithms. Multi-threading will add additional overhead for synchronization (and complexity, tons of complexity) that might slightly reduce the memory bandwidth. Prefer alorithms that can be implemented avoiding fine grained synchronizations.

If you see I/O wait times, think about clever partitioning or caching and then about threading. There is a reason why GNU-make supported parallel build back in the 90's :-)

The problem domain you've described leads me to gav a look at clever algorithms first. Try to using sequential read/write operations on main memory as much as possible to support the CPU and memory subsystems as much as possible. Keep operations "local" and datastructures as small and optimzed as possible to reduce the amount of memory that needs to be shuffled around before switching to a second core.

Ralf Edmund
+28  A: 

Multithreading across multiple cores could reduce the time required to sum across the axes, but special care is required. You might actually get larger performance boosts from some changes you could make to your single thread code:

  1. You only need as many threads to match the number of cores available to you. This is a CPU intensive operation, and threads are unlikely to be waiting for I/O.

  2. The above assumption might not hold if the entire array does not fit in RAM. If portions of the array are paged in and out, some threads will be waiting for paging operations to complete. In that case, the program might benefit from having more threads than cores. Too many, however, and performance will drop due to the cost of context switching. You might have to experiment with the thread count. The general rule is to minimize the number of context switches between ready threads.

  3. If the entire array does not fit in RAM, you want to minimize paging! The order in which each thread accesses memory matters, as does the memory access pattern of all the running threads. To the extent possible, you would want to finish up with one part of the array before moving to the next, never to return to a covered area.

  4. Each core would benefit from having to access a completely separate region of memory. You want to avoid memory access delays caused by locks and bus contention. At least for one dimension of the cube, that should be straightforward: set each thread with its own portion of the cube.

  5. Each core would also benefit from accessing more data from its cache(s), as opposed to fetching from RAM. That would mean ordering the loops such that inner loops access nearby words, rather than skipping across rows.

  6. Finally, depending on the data types in the array, the SIMD instructions of Intel/AMD processors (SSE, at their various generations) can help accelerate single core performance by summing multiple cells at once. VC++ has some built in support.

  7. If you have to prioritize your work, you might want to first minimize disk paging, then concentrate on optimizing memory access to make use of the CPU caches, and only then deal with multithreading.

Oren Trutner
This is it! Thank you very much, this is EXACTLY what I have been looking for!
Faken
In terms of spatial locality, I'd also look at http://en.wikipedia.org/wiki/Hilbert_curve - this is a algorithm for moving across a space while maximising spatial locality - it should help your cache usage and speed up your accesses.
Dave Rigby
Sorry Dave, what your saying makes little sense to me. The 3D array in this case is actually a giant 1 billion element 1D array allocated to the HEAP...which is linear, in terms of spacial locality, that would only be valid along the 1D path, which, would then only be valid for my projections in one axis only (which i could re-shuffle the data so that it would apply for other axis, but the computational time and headache is not worth it).
Faken
@Faken: Ah yes, sorry I'd misunderstood your data structure. Having said that, you will be thrashing the CPU cache, as you'll be accessing elements of the array which are adjacent in 3D space (i.e. one column) which will be very spread out in the 1D array. onebyone's answer below describes this well.
Dave Rigby
"You want to avoid memory access delays caused by locks and bus contention." One way to avoid write contention in the other dimensions is to "shard" the totals. This means each thread writes to its own array of totals, and you add them all up single-threaded at the end. With only four cores the duplication is a significant but not massive memory overhead, and the code is almost certainly simpler than ensuring that simultaneous parcels of work are "diagonal" (i.e. the projections onto the faces of the cube are non-intersecting).
Steve Jessop
Prefetching a proper distance ahead of where you need to use the data may also give a big boost. It will be fairly easy since you are stepping through the array in a very regular way.
Greg Rogers
+5  A: 

How does your code work. Does it go like this?

for each row: add up the values
for each column: add up the values
for each stack: add up the values

If so, you might want to read up on "locality of reference". Depending how your data is stored, you might find that while you're doing the stacks, a whole cache line has to be pulled in for each value, because the values are nowhere near each other in memory. In fact, with a billion values, you could be pulling things all the way from disk. Sequential access with a long stride (distance between values) is the worst possible use for cache. Try profiling, and if you see that adding up the stacks is taking longer than adding up the rows, this is almost certainly why.

I think you could be saturating the memory bus(*), in which case multithreading would only help if core2 quad uses different buses for different cores. But if you're not saturating the bus bandwidth, you can't get best performance this way even once you multi-thread. You'll have 4 cores spending all their time stalled on cache misses instead of one.

If you are memory cache bound, then your goal should be to visit each page/line of memory as few times as possible. So I'd try things like running over the data once, adding each value to three different totals as you go. If that runs faster on a single core, then we're in business. The next step is that with a 1000x1000x1000 cube, you have 3 million totals on the go. That doesn't fit in cache either, so you have to worry about the same cache miss problems writing as you do reading.

You want to make sure that as you run along a row of 1000 adjacent values in RAM adding to the row total that they all share, you're also adding to adjacent totals for the columns and stacks (which they don't store). So the "square" of column totals should be stored in the appropriate way, as should the "square" of stacks. That way you deal with 1000 of your billion values just by pulling about 12k of memory into cache (4k for 1000 values, plus 4k for 1000 column totals, plus 4k for 1000 stack totals). As against that, you're doing more stores than you would be by concentrating on 1 total at a time (which therefore could be in a register).

So I don't promise anything, but I think it's worth looking at order of memory access, whether you multi-thread or not. If you can do more CPU work while accessing only a relatively small amount of memory, then you'll speed up the single-threaded version but also put yourself in much better shape for multi-threading, since the cores share a limited cache, memory bus, and main RAM.

(*) Back of envelope calculation: in random random reviews off the internet the highest estimated FSB bandwidth for Core2 processors I've found so far is an Extreme at 12GB/s, with 2 channels at 4x199MHz each). Cache line size is 64 bytes, which is less than your stride. So summing a column or stack the bad way, grabbing 64 bytes per value, would only saturate the bus if it was doing 200 million values per second. I'm guessing it's nothing like this fast (10-15 seconds for the whole thing), or you wouldn't be asking how to speed it up.

So my first guess was probably way off. Unless your compiler or CPU has inserted some very clever pre-fetching, a single core cannot be using 2 channels and 4 simultaneous transfers per cycle. For that matter, 4 cores couldn't use 2 channels and 4 simultaneous transfers. The effective bus bandwidth for a series of requests might be much lower than the physical limit, in which case you would hope to see good improvements from multi-threading simply because you have 4 cores asking for 4 different cache lines, all of which can be loaded simultaneously without troubling the FSB or the cache controller. But the latency is still the killer, and so if you can load less than one cache line per value summed, you'll do much better.

Steve Jessop
I only have a 1033 mhz FSB, its the first generation core2 quads, the computer is over 2 years old already. You guys seem much more into this question that i had first expected...I guess ill post the actual code, you guys seem rather interested.
Faken
+2  A: 

I think that even if multithreading can produce a performance boost it's the wrong way to approach optimization. Multiple cores are all the rage because they are the only way for CPU manufacturers to provide faster CPU speeds at a marketable rate -- not necessarily because they're an amazing programming tool (there's still a lot of maturing that needs to happen).

Always look at the algorithm you're using above all else. You say your program is very RAM intensive -- what can you do to improve cache hits? Is there a way to sort your array so that the computations can be applied linearly? What programming language are you using and would it benefit you to optimize in a lower level language? Is there a way that you can use dynamic programming to store your results?

In general, spend all your resources working toward a more efficient algorithm, mathematically and as compiler optimizations, then worry about multi-core. Of course, you may already be at that stage, in which case this comment isn't very useful ;p

Kai
+2  A: 

The questions you need to answer for your particular application are well-known.

First, is the work parallelisable? Amdahl's Law will give you an upper bound on how much you can speed things up with multithreading.

Second, would a multithreaded solution introduce a lot of overhead? You say the program is "RAM intensive as the program is constantly fetching information from the RAM, both reading and writing." So you need to determine if the reading/writing is going to cause significant coordination overhead. This isn't easy. Although each CPU can access the computer's entire RAM (both read and write) at any time, doing so can slow down memory accesses -- even without locks -- because the various CPUs keep their own caches and need to coordinate what's in their caches with each other (CPU 1 has a value in cache, CPU 2 updates that value in RAM, CPU 2 has to tell CPU 1 to invalidate its cache). And if you do need locks (which is almost a guarantee as you're both "reading and writing" memory) then you'll need to avoid contention as much as possible.

Third, are you memory bound? "RAM intensive." is not the same thing as "memory bound." If you are currently CPU bound then multithreading will speed things up. If you are currently memory bound then multithreading may even slow things down (if one thread is too fast for memory, then what will happen with multiple threads?).

Fourth, are you slow for some other reason? If you're newing or mallocing a lot of memory in your algorithm you may be seeing overheads from that alone. And on many platforms both new and malloc don't handle multithreading well, so if you're slow right now because malloc is bad, a multithreaded program will be even slower because malloc will be worse.

Overall, however, without seeing your code, I would expect it to be CPU bound and I would expect multithreading to speed things up -- almost as much as Amdahl's law would suggest, in fact. You may want to look at OpenMP or Intel's Threading Building Blocks library, or some sort of thread queue to do it, though.

Max Lybbert
+1 for Amdahl's Law.
Paul Nathan
+2  A: 

Before you go multithreaded, you should run a profiler against your code. It's probably a different question as to where a good (possibly) free C++ profiler can be found.

This will help you identify any bits of your code that are taking up significant portions of computation time. A tweak here and there after some profiling can sometimes make massive differences to performance.

Scott Langham
+2  A: 

Though this would probably be very challenging to you if you're new to programming, a very powerful way to speed things up would be to use the power of GPU. Not only is the VRAM much faster than usual RAM, the GPU can also run your code in parallel on some 128 or more cores. Of course, for this amount of data you will need to have a pretty large VRAM.

If you decide to check this possibility out, you should look up nVidia CUDA. I haven't checked it out myself, but it's meant for problems like this.

Vilx-
I may check it out. I know deeper into my project there may be a use or even necessity for it.
Faken
A: 

I guess if you are just dealing with bits you might not have to page or use a swap file and in that case YES multi-threading will help.

If you can't load everything into memory at once, you need to be more specific about your solution--it needs to be tailored to threading.

For example: Suppose you load your array in smaller blocks (Size might not matter much). If you were to load in a 1000x1000x1000 cube, you could sum on that. The results could be stored temporarially in their own three plains, then added to your 3 "final result" planes, then the 1000^3 block could be thrown away never to be read again.

If you do something like this, you won't run out of memory, you won't stress the swapfile and you won't have to worry about any thread synchronization except in a few very small, specific areas (if at all).

The only problem then is to ensure your data is in such a format that you can access a single 1000^3 cube directly--without seeking the hard disk head all over the place.

Edit: The comment was correct and I'm wrong--he totally makes sense.

Since yesterday I realized that the entire problem could be solved as it was read in--each piece of data read in could immediately be summed into the results and discarded. When I think about it that way, you're right, not going to be much help unless the threading can read two streams at the same time without colliding.

Bill K
I don't do a ton of multi-threaded programming, but I've done a bit and this seems to me to be correct. Somebody spammed like 5 downvotes on reasonable answers in this thread without stating "Why" on a single one. I'm willing to learn if my answer has a huge flaw (Data I/O is the most likely I can think of, but no storage system is specified in the question!). Anyway, could someone please educate a little? It's the difference between being helpful and being a dick. Thanks.
Bill K
With simple task like addition, the program is often not ALU limited ("CPU" limited) at all, rather memory bus limited. This is very important for this question, The best answers for this questions reflect this, those I have downvoted do not.
Suma
+1  A: 

Eliminate False Sharing

This is where is multiple cores are blocking on each other trying to read or update different memory addresses that share the same block cache. Processor cache locking is per block, and only one thread can write to that block at once.

Herb Sutter has a very good article on False Sharing, how to discover it and how to avoid it in your parallel algorithms.

Obviously he has loads of other excellent articals on concurrent programming too, see his blog.

iain
the way this would be mutithreaded, there would be no locks used since each thread could not possibly read or write on something that another thread has access to.
Faken
Sorry for the late reply. I know you might not use locks in your code, however the processor's cache has a lock that prevents multiple cores writing to the same area of the cache at the same time. The trouble is that you have no control of these locks or the size of their area. So if your data is located close together your threads can end up competing for these cache locks, resulting in extra threads causing worse performance. One technique to mitigate this is to use the stack then copy results to the heap at the end.
iain
+3  A: 

There is only one way to optimize code: figure out what you're doing that's slow, and do less of it. A special case of "doing less of it" is to do something else instead that's faster.

So first of all, here's what I'm doing based on your posted code:

#include <fstream>
#include <sstream>
using std::ios_base;

template<typename Iterator, typename Value>
void iota(Iterator start, Iterator end, Value val) {
    while (start != end) {
        *(start++) = val++;
    }
}

int main() {

    const int dim = 1000;
    const int cubesize = dim*dim*dim;
    const int squaresize = dim*dim;
    const int steps = 7; //ranges from 1 to  255
    typedef unsigned char uchar;

    uchar *partMap = new uchar[cubesize];
    // dummy data. I timed this separately and it takes about
    // a second, so I won't worry about its effect on overall timings.
    iota(partMap, partMap + cubesize, uchar(7));
    uchar *projection = new uchar[squaresize];

    for (int stage = 1; stage < steps; stage++) {
        for (int j = 0; j < dim; j++) {
                for (int i = 0; i < dim; i++)
                {
                        int sum = 0;
                        for (int k = 0; k < dim; k++)
                            if (partMap[(((i * dim) + k) * dim) + j] >= stage)
                                sum++;

                        projection[(j*dim) + i] = sum;
                }
        }

        std::stringstream filename;
        filename << "results" << stage << ".bin";
        std::ofstream file(filename.str().c_str(), 
            ios_base::out | ios_base::binary | ios_base::trunc);
        file.write((char *)projection, squaresize);
    }

    delete[] projection;
    delete[] partMap;
}

(Edit: just noticed that "projection" should be an array of int, not uchar. My bad. This will make a difference to some of the timings, but hopefully not too big of one.)

Then I copied result*.bin to gold*.bin, so I can check my future changes as follows:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    1m41.978s
user    1m39.450s
sys     0m0.451s

OK, so 100 seconds at the moment.

So, speculating that it's striding through the billion-item data array that's slow, let's try only going through once, instead of once per stage:

    uchar *projections[steps];
    for (int stage = 1; stage < steps; stage++) {
         projections[stage] = new uchar[squaresize];
    }

    for (int j = 0; j < dim; j++) {
            for (int i = 0; i < dim; i++)
            {
                    int counts[256] = {0};
                    for (int k = 0; k < dim; k++)
                            counts[partMap[(((i * dim) + k) * dim) + j]]++;

                    int sum = 0;
                    for (int idx = 255; idx >= steps; --idx) {
                        sum += counts[idx];
                    }
                    for (int stage = steps-1; stage > 0; --stage) {
                        sum += counts[stage];
                        projections[stage][(j*dim) + i] = sum;
                    }
            }
    }

    for (int stage = 1; stage < steps; stage++) {
        std::stringstream filename;
        filename << "results" << stage << ".bin";
        std::ofstream file(filename.str().c_str(),
            ios_base::out | ios_base::binary | ios_base::trunc);
        file.write((char *)projections[stage], squaresize);
    }

    for (int stage = 1; stage < steps; stage++) delete[] projections[stage];
    delete[] partMap;

It's a bit faster:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    1m15.176s
user    1m13.772s
sys     0m0.841s

Now, steps is quite small in this example, so we're doing a lot of unnecessary work with the "counts" array. Without even profiling, I'm guessing that counting to 256 twice (once to clear the array and once to sum it) is quite significant compared with counting to 1000 (to run along our column). So let's change that:

    for (int j = 0; j < dim; j++) {
            for (int i = 0; i < dim; i++)
            {
                    // steps+1, not steps. I got this wrong the first time,
                    // which at least proved that my diffs work as a check
                    // of the answer...
                    int counts[steps+1] = {0};
                    for (int k = 0; k < dim; k++) {
                        uchar val = partMap[(((i * dim) + k) * dim) + j];
                        if (val >= steps) 
                            counts[steps]++;
                        else counts[val]++;
                    }

                    int sum = counts[steps];
                    for (int stage = steps-1; stage > 0; --stage) {
                        sum += counts[stage];
                        projections[stage][(j*dim) + i] = sum;
                    }
            }
    }

Now we're only using as many buckets as we actually need.

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m27.643s
user    0m26.551s
sys     0m0.483s

Hurrah. The code is nearly 4 times as fast as the first version, and produces the same results. All I've done is change what order the maths is done: we haven't even looked at multi-threading or prefetching yet. And I haven't attempted any highly technical loop optimisation, just left it to the compiler. So this can be considered a decent start.

However it's still taking an order of magnitude longer than the 1s which iota runs in. So there are probably big gains still to find. One main difference is that iota runs over the 1d array in sequential order, instead of leaping about all over the place. As I said in my first answer, you should aim to always use sequential order on the cube.

So, let's make a one-line change, switching the i and j loops:

            for (int i = 0; i < dim; i++)
    for (int j = 0; j < dim; j++) {

This still isn't sequential order, but it does mean we're focussing on one million-byte slice of our cube at a time. A modern CPU has at least 4MB cache, so with a bit of luck we'll only hit main memory for any given part of the cube once in the entire program. With even better locality we could reduce the traffic in and out of L1 cache, too, but main memory is the slowest.

How much difference does it make?

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m8.221s
user    0m4.507s
sys     0m0.514s

Not bad. In fact, this change alone brings the original code from 100s to 20s. So this is responsible for a factor of 5, and everything else I did is responsible for another factor of 5 (I think the difference between 'user' and 'real' time in the above is mostly accounted for by the fact my virus scanner is running, which it wasn't earlier. 'user' is how much time the program occupied a CPU, 'real' includes time spent suspended, either waiting on I/O or giving another process time to run).

Of course, my bucket sort relies on the fact that whatever we're doing with the values in each column is commutative and associative. Reducing the number of buckets only worked because large values are all treated the same. This might not be true for all your operations, so you'll have to look at the inner loop of each one in turn to figure out what to do with it.

And the code is a bit more complicated. Instead of running over the data doing "blah" for each stage, we're computing all the stages at the same time in a single run over the data. If you start doing row and column computations in a single pass, as I recommended in my first answer, this will get worse. You may have to start breaking your code into functions to keep it readable.

Finally, a lot of my performance gain came from an optimisation for the fact that "steps" is small. With steps=100, I get:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 5
6; do diff -q results$n.bin gold$n.bin; done
g++  -O3 -pedantic -Wall   big.cpp   -o big

real    0m22.262s
user    0m10.108s
sys     0m1.029s

This isn't so bad. With steps=100 the original code probably takes about 1400 seconds, although I'm not going to run it to prove that. But it's worth remembering that I haven't completely taken away the time dependency on "steps", just made it sub-linear.

Steve Jessop
I read over it quickly and don't quite understand. Give me a day or so and I'll sit down and go over it very carefully. I will not use any code that I don't fully understand, and even then, i will not copy and paste code into my programs. Your factor of 5 time reduction is interesting. I'll need to do some research on computer structure and things like that. If I do end up using the concepts you explained to me, I will defiantly give you credit for it. Thanks for the time and effort you put into this, it is much appreciated.
Faken
Haha! Over 1 month later, but i have never forgotten about your post. I finally understand. It wasn't until I had gotten much more programming experience and knowledge about modern CPUs that i could actually understand this. I'll implement my own version of what you have here when I have some time. The entire problem is not about mutithreading, its all about getting cache hits! I don't need more clock cycles, i need more memory bandwidth, the only way to get that is to utilize the cache!
Faken
Thanks for that comment - I'll bear in mind in future that new C++ programmers will need explanations closer to first principles.
Steve Jessop
A: 

Try this code:

int dim = 1000;
int steps = 7 //ranges from 1 to  255

for (int stage = 1; stage < steps; stage++)
for (int k = 0; k < dim; k++)
    for (int i = 0; i < dim; i++)
    {
            sum = 0;
            for (int j = 0; j < dim; j++)
                    if (partMap[(((i * dim) + k) * dim) + j] >= stage)
                            projection[i*dim + j] ++ ;
                            // changed order of i and j
    }


transponse(projection)

I changed the order of loops to make the code cache friendly... You would gain with it an order of magninute performance boost... Be shure.

This is the step you should do before you try to run into multithreading

Artyom
But with this method won't i run into problems of using even more RAM bandwidth than before? before i would run into 1 billion RAM read operations (read from partMap) and 1 million ram write operations (written to projection). But with this new method i would run into 2 billion read operations (one read from partMap, then another read from projection) and 1 billion write operations (to projection), i don't understand how that could be better.
Faken
The difference is simple: you read the memory in sequential order, every modern CPU has "prefetch" ability, thus reading and writing memory sequentially is much faster then random access that makes cache miss on every step. (Cache miss consts hundreds of cycles). Just make a simple test run and you will see that the speed of you program improves by order of mangintude.
Artyom
+1  A: 

It's a matrix problem?

Both Intel and AMD have super-optimized libraries for all sorts of heavy math problems. These libraries use threading, arrange the data for best cache use, cache prefetch, SSE vector instructions. Everything.

I believe you have to pay for the libraries, but they are well worth the money.

Zan Lynx
it is not a matrix problem. Its actually my BS'ed attempt at handling 3D data in a form that I can understand. I have only about 1 month's worth of C++ programming experience and besides, Im a mechanical engineer, not comp sci. I got this idea to handle 3D data in my program from working with FEA and CFD programs, depending on the settings and the program, they do something very similar.
Faken