views:

809

answers:

8

On a 32-bit CPU, an integer is 4 bytes and a short integer is 2 bytes. If I am writing a C/C++ application that uses many numeric values that will always fit within the provided range of a short integer, is it more efficient to use 4 byte integers or 2 byte integers?

I have heard it suggested that 4 byte integers are more efficient as this fits the bandwidth of the bus from memory to the CPU. However, if I am adding together two short integers, would the CPU package both values in a single pass in parallel (thus spanning the 4 byte bandwidth of the bus)?

+6  A: 

Yes, you should definitely use a 32 bit integer on a 32 bit CPU, otherwise it may end up masking off the unused bits (i.e., it will always do the maths in 32 bits, then convert the answer to 16 bits)

It won't do two 16 bit operations at once for you, but if you write the code yourself and you're sure it won't overflow, you can do it yourself.

Edit: I should add that it also depends somewhat on your definition of "efficient". While it will be able to do 32-bit operations more quickly, you will of course use twice as much memory.

If these are being used for intermediate calculations in an inner loop somewhere, then use 32-bit. If, however, you're reading this from disk, or even if you just have to pay for a cache miss, it may still work out better to use 16-bit integers. As with all optimizations, there's only one way to know: profile it.

MrZebra
It should be noted that stdint.h in C99 has int_fastN_t and uint_fastN_t type, where N is 8/16/32/64 (not all are always available though). boost has an equivalent for c++, and g++ also tends to include stdint.h. Which are supposed to be the fastest types with a minimum of the required size.
Evan Teran
A: 

Duplicate question. See .NET Integer vs Int16? (It's labeled .Net, but it applies the same as it is about hardware architecture.)

Mufasa
+5  A: 

If you have a large array of numbers, then go with the smallest size that works. It will be more efficient to work with an array of 16 bit shorts than 32 bit ints since you get twice the cache density. The cost of any sign extension the CPU has to do to work with 16 bit values in 32 bit registers is trivially negligible compared to the cost of a cache miss.

If you are simply using member variables in classes mixed with other data types then it is less clear cut as the padding requirements will likely remove any space saving benefit of the 16 bit values.

Rob Walker
+2  A: 

It depends. If you are CPU bound, 32 bit operations on a 32 bit CPU will be faster than 16 bit. If you are memory bound (specifically if you have too many L2 cache misses), then use the smallest data you can squeeze into.

You can find out which one you are using a profiler that will measure both CPU and L2 misses like Intel's VTune. You will run your app 2 times with the same load, and it will merge the 2 runs into one view of the hotspots in your app, and you can see for each line of code how many cycles were spent on that line. If at an expensive line of code, you see 0 cache misses, you are CPU bound. If you see tons of misses, you are memory bound.

Jurney
+1  A: 

If you are operating on a large dataset, the biggest concern is memory footprint. A good model in this case is to assume that the CPU is infinitely fast, and spend your time worrying about how much data has to be moved to/from memory. In fact, CPUs are now so fast that it is sometimes more efficient to encode (e.g., compress) the data. That way, the CPU does (potentially much) more work (decoding/coding), but the memory bandwidth is substantially reduced.

Thus, if your dataset is large, you are probably better off using 16 bit integers. If your list is sorted, you might design a coding scheme that involves differential or run-length encoding, which will reduce memory bandwidth even more.

+4  A: 

If you're using "many" integer values, the bottleneck in your processing is liable to be bandwidth to memory. 16 bit integers pack more tightly into the data cache, and would therefore be a performance win.

If you are number crunching on a very large amount of data, you should read What Every Programmer Should Know About Memory by Ulrich Drepper. Concentrate on chapter 6, about maximizing the efficiency of the data cache.

DGentry
+1 for the link to the memory paper! wow!
Jason S
A: 

When you say 32bit, I'll assume you mean x86. 16 bit arithmetic is quite slow: the operand-size prefix makes decoding really slow. So don't make your temp variables short int or int16_t.

However, x86 can efficiently load 16 and 8 bit integers into 32 or 64 bit registers. (movzx / movsx: zero and sign extension). So feel free to use short int for arrays and struct fields, but make sure you use int or long for your temp variables.

However, if I am adding together two short integers, would the CPU package both values in a single pass in parallel (thus spanning the 4 byte bandwidth of the bus)?

That is nonsense. load/store instructions interact with L1 cache, and the limiting factor is number of ops; width is irrelevant. e.g. on core2: 1 load and 1 store per cycle, regardless of width. L1 cache has a 128 or 256bit path to L2 cache.

If loads are your bottleneck, one wide load which you split up with shifts or masks after loading can help. Or use SIMD to process data in parallel without unpacking after loading in parallel.

Peter Cordes
A: 

Don't listen to the advice, try it.

This is probably going to depend heavily on the hardware/compiler that you're using. A quick test should make short work of this question. Probably less time to write the test than it is to write the question here.

ablerman