views:

187

answers:

2

Hi. I have a simple program that first writes some native x86 instructions into a declared buffer, and then sets a function pointer to this buffer and makes a call. I'm noticing a severe performance penalty, however, when that buffer is allocated on the stack (as opposed to on the heap, or even in the globals data area). I verified that the start of the instruction sequence in the data buffer is on a 16-byte boundary (I'm assuming that's what the cpu requires (or wants) it to be). I don't know why it would make a difference where I execute my instructions in the process from, but in the program below, "GOOD" executes in 4 seconds on my dual-core workstation, and "BAD" takes 6 minutes or so. Is there some kind of alignment/i-cache/prediction issue going on here?? My evaluation license to VTune just ended, so I can't even run an analysis on this :(. Thanks.


#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef int (*funcPtrType)(int, int);

int foo(int a, int b) { return a + b; }

void main()
{
  // Instructions in buf are identical to what the compiler generated for "foo".
  char buf[201] = {0x55,
                   0x8b, 0xec,
                   0x8b, 0x45, 0x08,
                   0x03, 0x45, 0x0c,
                   0x5D,
                   0xc3
                  };

  int i;

  funcPtrType ptr;

#ifdef GOOD
  char* heapBuf = (char*)malloc(200);
  printf("Addr of heap buf: %x\n", &heapBuf[0]);
  memcpy(heapBuf, buf, 200);
  ptr = (funcPtrType)(&heapBuf[0]);
#else // BAD
  printf("Addr of local buf: %x\n", &buf[0]);
  ptr = (funcPtrType)(&buf[0]);
#endif

  for (i=0; i < 1000000000; i++)
    ptr(1,2);
}


The results from running this are:

$ cl -DGOOD ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
Copyright (C) Microsoft Corp 1984-1997. All rights reserved.

ne3.cpp
Microsoft (R) 32-Bit Incremental Linker Version 5.10.7303
Copyright (C) Microsoft Corp 1992-1997. All rights reserved.

/out:ne3.exe
ne3.obj
$ time ./ne3
Addr of heap buf: 410eb0

real 0m 4.33s
user 0m 4.31s
sys 0m 0.01s
$
$
$ cl ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
Copyright (C) Microsoft Corp 1984-1997. All rights reserved.

ne3.cpp
Microsoft (R) 32-Bit Incremental Linker Version 5.10.7303
Copyright (C) Microsoft Corp 1992-1997. All rights reserved.

/out:ne3.exe
ne3.obj
$ time ./ne3
Addr of local buf: 12feb0

real 6m41.19s
user 6m40.46s
sys 0m 0.03s
$

Thanks.

  • Shasank
+2  A: 

My guess is that, since you have the variable i on the stack also, when you change i in your for loop, you trash the same cache line that the code is sitting in. Put the code in the middle of your buffer somewhere (and perhaps enlarge the buffer) so as to keep it separated from the other stack variables.

Also note that execution of instructions on the stack is usually the hallmark of a security exploit (such as a buffer overrun) being exploited.

Therefore the OS is often configured to disallow this behaviour. Virus scanners may take action against it as well. Perhaps your program is running through a security check each time it tries to access that stack page (though I'd expect the sys time field to be larger in that case).

If you want to "officially" make a memory page executable, you should probably look into VirtualProtect().

Artelius
Good thinking, +1. :-)
DigitalRoss
Thanks Artelius. I added a comment under DigitalRoss's answer. - Shasank
Shasank
What you could do if you want to check this is put a big chunk of data between `buf` and `i` (using something like `char junk[10000]` between the declarations of `buf` and `i`) which would ensure that they aren't in the same cacheline. That way you wouldn't be running into self-modifying code, which is really really really bad for performance.
Nathan Fellman
Though watch out that the compiler doesn't optimise it out :P
Artelius
+3  A: 

Stack protection for security?

As a wild guess, you could be running into an MMU-based stack protection scheme. A number of security holes were based on deliberate buffer overruns, which inject executable code onto the stack. One way to fight these is with a non-executable stack. This would result in a trap into the OS, where I suppose it's possible that the OS or some virus SW does something.

Negative i-cache coherency interaction?

Another possibility is that using both code and data accesses to nearby addresses is defeating the CPU cache strategy. I believe x86 implements an essentially automatic code/data coherency model, which is likely to result in the invalidation of large amounts of nearby cached instructions on any memory write. You can't really fix this by changing your program to not use the stack (obviously you can move the dynamic code) because the stack is written by the machine code all the time, for example, whenever a parameter or return address is pushed for a procedure call.

The CPU's are really fast these days relative to the DRAM or even the outer level cache rings, so anything that defeats the inner cache rings will be quite serious, plus its implementation probably involves some sort of micro-trap within the CPU implementation, followed by a "loop" in HW to invalidate things. It isn't something Intel or AMD would have worried about speed on, since for most programs it would never happen and when it did it would normally only happen once after loading a program.

DigitalRoss
+1, because great minds clearly think alike :P
Artelius
Really good thinking, guys. So, in order to prevent this from happening (note, from what you're saying, I assume this can happen even if the instructions were stored in heap), the instructions should sit in the middle of a data buffer that's spaced out in both directions by "cache-line-size" bytes, so that writes to neighboring data won't dirty the cache-line containing my instructions? Thanks again for your speedy responses. - Shasank
Shasank
I can't say for sure. Software that ran on the 8086 didn't need to worry about caches, so for backwards-compatibility reasons, Intel decided to do lots of cache management in hardware, and that's complex. I'm sure there are some experts on the subject but for the likes of you and me: test the code, see how it performs!
Artelius
Well, check my comment on top under your question .. you definitely want to single step BOTH heap and stack versions to make sure that what you think is happening is really happening. Just keeping writes a cache-block away from code may not be good enough, the coherency granularity could be page-sized or whatever the designers wanted. An interesting idea would be to allocate something pretty big on the stack .. several pages. See if being 4k bytes or 40k bytes away helps...
DigitalRoss