ansaurus

Question

Performance penalty with executing x86 instructions stored in the data segment?

Answer 1

+2 A:

My guess is that, since you have the variable i on the stack also, when you change i in your for loop, you trash the same cache line that the code is sitting in. Put the code in the middle of your buffer somewhere (and perhaps enlarge the buffer) so as to keep it separated from the other stack variables.

Also note that execution of instructions on the stack is usually the hallmark of a security exploit (such as a buffer overrun) being exploited.

Therefore the OS is often configured to disallow this behaviour. Virus scanners may take action against it as well. Perhaps your program is running through a security check each time it tries to access that stack page (though I'd expect the sys time field to be larger in that case).

If you want to "officially" make a memory page executable, you should probably look into VirtualProtect().

Artelius 2009-10-21 23:35:10

Good thinking, +1. :-)

DigitalRoss 2009-10-21 23:46:36

Thanks Artelius. I added a comment under DigitalRoss's answer. - Shasank

Shasank 2009-10-22 23:26:40

What you could do if you want to check this is put a big chunk of data between `buf` and `i` (using something like `char junk[10000]` between the declarations of `buf` and `i`) which would ensure that they aren't in the same cacheline. That way you wouldn't be running into self-modifying code, which is really really really bad for performance.

Nathan Fellman 2009-10-27 21:13:56

Though watch out that the compiler doesn't optimise it out :P

Artelius 2009-10-27 22:48:31

Answer 2

+3 A:

Stack protection for security?

As a wild guess, you could be running into an MMU-based stack protection scheme. A number of security holes were based on deliberate buffer overruns, which inject executable code onto the stack. One way to fight these is with a non-executable stack. This would result in a trap into the OS, where I suppose it's possible that the OS or some virus SW does something.

Negative i-cache coherency interaction?

Another possibility is that using both code and data accesses to nearby addresses is defeating the CPU cache strategy. I believe x86 implements an essentially automatic code/data coherency model, which is likely to result in the invalidation of large amounts of nearby cached instructions on any memory write. You can't really fix this by changing your program to not use the stack (obviously you can move the dynamic code) because the stack is written by the machine code all the time, for example, whenever a parameter or return address is pushed for a procedure call.

The CPU's are really fast these days relative to the DRAM or even the outer level cache rings, so anything that defeats the inner cache rings will be quite serious, plus its implementation probably involves some sort of micro-trap within the CPU implementation, followed by a "loop" in HW to invalidate things. It isn't something Intel or AMD would have worried about speed on, since for most programs it would never happen and when it did it would normally only happen once after loading a program.

DigitalRoss 2009-10-21 23:38:36

+1, because great minds clearly think alike :P

Artelius 2009-10-21 23:43:07

Really good thinking, guys. So, in order to prevent this from happening (note, from what you're saying, I assume this can happen even if the instructions were stored in heap), the instructions should sit in the middle of a data buffer that's spaced out in both directions by "cache-line-size" bytes, so that writes to neighboring data won't dirty the cache-line containing my instructions? Thanks again for your speedy responses. - Shasank

Shasank 2009-10-22 23:25:52

I can't say for sure. Software that ran on the 8086 didn't need to worry about caches, so for backwards-compatibility reasons, Intel decided to do lots of cache management in hardware, and that's complex. I'm sure there are some experts on the subject but for the likes of you and me: test the code, see how it performs!

Artelius 2009-10-22 23:45:04

Well, check my comment on top under your question .. you definitely want to single step BOTH heap and stack versions to make sure that what you think is happening is really happening. Just keeping writes a cache-block away from code may not be good enough, the coherency granularity could be page-sized or whatever the designers wanted. An interesting idea would be to allocate something pretty big on the stack .. several pages. See if being 4k bytes or 40k bytes away helps...

DigitalRoss 2009-10-22 23:57:41

ansaurus

tags:

views:

answers:

Performance penalty with executing x86 instructions stored in the data segment?

Stack protection for security?

Negative i-cache coherency interaction?

related questions