views:

36

answers:

2

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.

I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.

I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.

Has anyone tried this?

(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)

+2  A: 

In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.

Paul R
Do you think that the cache specific SSE instructions would help, specifically _mm_stream_ps in the case of float data?The MSDN documentation states that this instruction "Stores the data in a to the address p without polluting the caches."http://msdn.microsoft.com/en-us/library/78x83000(v=VS.80).aspx
mindless_developer_man
mindless_developer_man
You *may* be able to tune things a little, but it would probably be better to invest that effort into re-structuring your algorithm so that it writes sequentially (or at least with reasonable locality) if at all possible.
Paul R
+2  A: 

I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.

Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.

GvS
Yeah, that sounds doable. I could divide it up into subranges and only read data within that range. What size of page would you recommend? Both my input and my output date sets are 10MB in size. It may be best to separate both the input and the output into pages -- thus I would have N partitions with M passes each. I could do each of pass across multiple cores at once.
mindless_developer_man