views:

79

answers:

3

I'm procedurally generating 128-byte blocks with some set n-byte header reserved for machine-language functions that I'm simply calling via in-line assembly. They aren't defined anywhere and are generated at run-time into pages allocated into memory with access for execution. However, I want to reserve the end (128 - n) bytes of these blocks for storing data for use within these functions due to being able to shrink the memory offset calls to 8 bits instead of 32 bits and also (possibly?) aiding with caching. However, caching is what I'm worried about.

Assuming I have a processor that has both cache(s) for data and also an instruction cache, how well does the typical processor of this kind deal with this sort of formatting? Will it attempt to load the data after my instructions as instructions themselves into the instruction cache? Could this cause a significant performance penalty as the processor tries to figure out how to deal with these junk and possibly invalid "instructions" considering they'll be floating around in near proximity for essentially every call? Will it load this data into the normal L1/L2 caches once I do my first access of it at the head of the data segment or will it just be all confused at this point?

Edit: I guess I should add that optimization of through-put is, obviously, rather important. How confusing or difficult the optimization is doesn't matter in this case, just minimizing the execution time of the code.

+2  A: 

On present-day processors, the L2 and higher caches should be fine, since they are mixed anyway. The L1 caches (and sometimes closely related structures, like trace caches, micro-op caches, and so forth) might be affected by this trickery. This will probably also vary in impact from one microarchitecture to another. I would hope that any trace/microop caches don't suffer a penalty from data that it can't decode, but I wouldn't count on it. You'll have to try it and benchmark on the microarchitectures of relevance to your application.

Edit: Are you doing things this way to minimize the size of the generated code , or so that you're guaranteed to have the data in cache when you have the instructions, or some other reason? This may be more complication than you really need. Again, benchmarks and profiling are your friend.

Novelocrat
I put in an edit earlier about why I'm doing this but after rereading it, it isn't too clear. This is all about minimizing execution time, mostly through good cache/memory usage (smaller instructions, data locality). The actual amount of RAM used isn't a problem in itself, just the speed penalties it could result in. Of course, I'll be benchmarking and experimenting, I was just tossing out this question to get an idea of the limitations of using this hair-brained method.
Good point about trace caches. They store translated instructions. Since the data won't be executed, it doesn't generate translated instructions and doesn't end up in the trace cache.
MSalters
+1  A: 

There will be some penalty since the blocks will be loaded into both the L1 instruction and data caches, which will waste space. The amount of space wasted depends on the size of a cache block, but it probably won't be offset by the savings of a reduced instruction size. L2 caches and below are usually shared between instructions and data and will not be affected.

The CPU probably won't attempt to decode the data in the blocks, since you probably have a return or unconditional branch as the last instruction. Any sane CPU will not fetch or decode instructions following this.

Jay Conrod
A: 

As the other answers note, the only performance penalty you might encounter is having the same cache line in the code L1 and in the data L1, which will waste some space (and even that won't be a real issue because the caches fill up based on what they need. As far as I recall there aren't any restrictions on a cache line being present in both caches).

There is one point which the other answers overlook. If you plan to modify the data that is close to the code, you're very likely to trigger Self-Modifying Code scenarios, which do incur a very heavy penalty.

Self-Modifying Code (SMC) will flush the entire pipeline up until the store instruction, under the assumption that any of the instructions that is being executed speculatively may be incorrect, due to the modification. The deep pipeline of most modern x86 processors means that each such flush incurs a penalty of many cycles in which no instruction completes.

If you ensure that your have no stores that are near a code segment, you should be fine.

Nathan Fellman