ansaurus

Question

Bit fields: Set vs test-and-set (for performance)

Answer 1

A:

You could always profile, but I am pretty sure the first version is both faster and less obscure.

Viktor Sehr 2009-06-08 11:55:03

Answer 2

A:

Either approach will require that the data is loaded into the cache, so your only saving will be a difference between a read/write and a write.

I don't see how this change could make your code slower with larger data sets, so you're likely safe enough on that front.

It smells a little like a premature-optimistaion to me. (Unless your profiling has identified this as a bottleneck)

As with all things performance related the best way to be sure of the effect of a code change is to measure it. You should be able to create a large amount of test data relatively easily.

Glen 2009-06-08 11:57:05

Yes, this function is definitely a bottleneck.

finnw 2009-06-08 12:04:32

Answer 3

+1 A:

This optimization will likely not cause speed decrease when moving to a larger dataset.

Cache thrashing when reading values will be the same, branch prediction penalties will also be the same and these are the key factors to optimize for here.

Branch prediction stores history per branch instruction so it doesn't care how many instances you have as long as you branch on them with instructions at different addresses (inlined function for example). If you have a single function entity (not inlined) you will have one branch instruction for all and this will suppress branch prediction making it miss more often and increasing the penalties.

sharptooth 2009-06-08 12:02:44

Are you sure about branch prediction penalties being the same? If a lot of short-lived instances are introduced, the flag will need to be updated and flushed more often, and that is what I am concerned about.

finnw 2009-06-08 12:07:10

Answer 4

+1 A:

If you're really worried about the time performance, change the flag to a full int instead of a bitfield. Then setting it is just a write and not a read-write as with bitfields.

But as already pointed out, this smells of micro-optimization.

laalto 2009-06-08 12:02:56

it doesn't matter what the type is, it still needs to be loaded into the cache / registers before you can do anything to it

Glen 2009-06-08 12:07:30

@Glen: Writing 1 to a memory address does not need to know what there was before. The problem with bitfields is that individual bits are not addressable. The code needs to first read the byte/short/int/whatever, modify it and write it back to make sure other bits at the same address are not affected.

laalto 2009-06-08 12:10:21

@laalto, if I understand correctly, words are not individually addressable in main memory either. You have to read, modify and write back a whole cache line to update the flag. It's the write that I want to avoid.

finnw 2009-06-08 12:14:41

Answer 5

A:

The test before set doesn't make any sense - code without test is cleaner and also a bit faster.

As a side note - it makes sense to inline functions like this because overhead on function call is bigger then function body, although optimising compiler should do it without second thought.

qrdl 2009-06-08 12:03:09

Are you sure about this? I believe testing first can avoid a write back to main memory if the flag was already set. That is why I introduced the test.

finnw 2009-06-08 12:12:31

Unless you are working on some exotic architecture, reading from and writing to memory cost the same - it is usually the same MOV, just with different operands, and normally takes one cycle. Test will add another MOV and conditional jump so you end up with 2 (test positive) or 3 (test negative) instructions instead of one.

qrdl 2009-06-08 13:24:11

Answer 6

+9 A:

The test before set does make a difference, but how much it is depends on your use-cases.

The data will end up in a cache-line in either case (e.g. just writing or test-and-set).

However, there is a difference if your cache line is tagged as dirty (e.g. modified) or clean. Dirty cache-lines have to be written back to main memory while clean cache-lines can just be forgotten and filled with new data.

Now consider that your code mangles huge amounts of data, and you access each chunk of data only once or twice. If so can assume that most of the memory accesses are cache misses. What happends if the majority of your cache-lines are dirty at the point where a cache miss occurs and the majority of cache lines are dirty?

They have to be written back to the main memory before new data will be loaded into the line. This is slower than just forgetting the content of a cache-line. Also it will double the memory bandwidth between the cache and the main memory.

That may not make a difference for once CPU core since memory is fast these days, but another CPU will (hopefully) do some other work as well. You can be sure that the other CPU core will execute everything a tad faster if the buss is not busy moving cache-lines in and out.

In short: keeping your cache-lines clean will half that bandwidth requirement and makes cache-misses a tad cheaper.

Regarding the branch: Sure: It's costly, but a cache-miss is much worse! Also if you're lucky the CPU will use it's out of order execution features to offset cache misses with the costs of the branch.

If you really want to get the best possible performance out of this code, and if most of your accesses are cache-misses you have two options:

Bypass the cache: The x86 architecture has non temporal loads and stores for this purpose. They're hidden somewhere in the SSE instruction sets and can be used from the c-language via intrinsics.
(Only for experts): Use some lines of inline-assembler that replaces the test-and-set fuction with assembler that uses the CMOV (conditional move) instruction. This will not only keep your cache-lines clean but avoid the branch. Now CMOV is a slow instruction and will only outperform a branch if the branches cannot be predicted. So you'll better benchmark your code.

Nils Pipenbrinck 2009-06-08 12:24:46

This is the only answer so far that shows any real understanding of how caches work. Bravo, Nils!If the OP happens to be on IA64 or ARM, then the compiler can, and hopefully will, use a predicated instruction for the store, rather than a branch, and so there will be no branch prediction issues at all.

Tom Anderson 2009-06-08 13:18:00

Answer 7

+2 A:

These are my interpretations of your requirement,

you have the flag initialized separately
it is set only once (to 1) and not reset after that
But, this set attempt will be made many times on the same flag
And, you have a lot these flag instances (each needing the same kind of processing)

Assuming that,

space optimization is weighted quite lower than time optimization,

I suggest the following things.

Firstly, on 32-bit systems it helps to use 32-bit integers if you are worried about access times
If you skip a check on the flag 'word', the write will be quite fast. But, given that you have a very large number of flags which you will keep checking and set if not already set, it would be better to keep the conditional check in.
But, having said that, if your platform does parallel operations, (for example, a write to the disk can be sent off in parallel to your code execution usually) it would be worthwhile skipping the check.

nik 2009-06-08 12:26:57

Answer 8

A:

Since nobody else said it, I will.

Why are you using a bit-field at all? The layout will differ from compiler to compiler, so they are useless for interfaces. They may, or may not, be more space efficient; the compiler might just decide to shove them into a 32-bit field so as to pad thing efficiently. There are no guarantees they are faster, and in fact they are likely to be slower.

I've banned their usage at work. Unless someone can give me a convincing reason that they provide any extra capability, it's not worth playing with them.

Chris Arguin 2009-06-08 17:08:26

Plenty of good reasons for bit fields. In the case of this poster, a bit field can be used to make better use of memory or memory bandwidth. Putting more information in a single word can mean many fewer round trips to main memory.

TokenMacGuy 2009-06-08 17:38:31

Answer 9

+2 A:

Nate Kohl 2009-06-08 17:25:26

Thats interesting actually.

Viktor Sehr 2009-09-12 11:06:12

ansaurus

tags:

views:

answers:

Bit fields: Set vs test-and-set (for performance)

related questions