views:

65

answers:

2

Edit - I guess the question I asked was too long so I'm making it very specific.

Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?

For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?

Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.

Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )

A: 

This is not an answer to your original question on computer-architecture, but might be relevant to your goal.

In this discussion, all array index starts with zero.


Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:

  1. An array of size
  2. An array of n, and a counter, used to emulate a set container. Duplicate values allowed.

Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.

When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.


A similar technique, known as a two-level histogram or radix histogram, can also be used.

Two pieces of information are stored:

  1. An array of size
  2. An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.

Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.

Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.

When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.

rwong
I'm not looking for a technique in software - in software I can't know when something is in the cache or not.
hawk
@hawk: My technique reduces the total number of writes needed to implement the same algorithm. So I thought it may be useful to you in a way that does not depend on a specific computer architecture.
rwong
A: 

Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:

  1. The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
  2. The old and new values are compared.
  3. If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.

Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.

Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.

So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.


The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.

http://blogs.amd.com/developer/tag/sse4a/

Read
  • Cache hit: Data is read from the cache line to the target register
  • Cache miss: Data is moved from memory to the cache, and read into the target register
Write
  • Cache hit: Data is moved from the register to the cache line
  • Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line
rwong
You do realize that on a multi-proc machine there needs to be a step 4 - evicting the value all the way to main memory. That step would take anywhere from 10-200 cycles. I imagine it might be worth it to save that extra step sometimes.
hawk
@hawk: You already know more than I do. Please read http://www.drdobbs.com/architecture-and-design/208200273. Maybe you can consider asking Herb Sutter directly if he is aware of such optimizations.
rwong