views:

99

answers:

4

Hello :)

I am interposing the memcpy() function in C because the target application uses it to concatenate strings and I want to find out which strings are being created. The code is:

void * my_memcpy ( void * destination, const void * source, size_t num )
{
    void *ret = memcpy(destination, source, num);
    // printf ("[MEMCPY] = %s \n", ret);
    return ret;
}

The function gets called succesfully but the first parameter can be whatever and I only want to trace it if the result is a string or array. I would have to ask if it is array or string. I know this can't be done straightforward: is there anyway to find out what RET points to?

I am working under MACOSX and interpositioning with DYLD.

Thank you very much.

A: 

ret is equal to the destination pointer. But it's not possible to determine whether it's an array or a string, unless you know more information about the array or string (for instance, that the string is of a certain length and is null-terminated).

Reinderien
+1  A: 

As void* represents a raw block of memory, there is no way to determine what actual data lies there.

However, you can make a "string-like" memory dump on every operation, just give the resulting output some sort of the "upper output limit".

This could be implemented the following way:

const size_t kUpperLimit = 32;

void output_memory_dump(void* memory) {
   std::cout.write(reinterpret_cast<char*>(memory), kUpperLimit);
}

For non-string like data the output would be hardly interpretable, but otherwise you'd get what you were searching for.

You could attempt to apply some guess-based approach like iterating through reinterpret_cast<void*>(memory) and making is_alphanumeric && is_space checks to every symbol, but this approach doesn't seem very stable (who knows what could actually lie in that void*...).

Anyway, for some situations that might be fine.

HardCoder1986
What if memory region is less than 32 bytes? Or less than 5? How do you know what upper limit to use? Well, at least you can use `num` for that, but this number might be less than actual memory region size.
Vlad Lazarenko
@Vlad Ok, we'll have a dump of 5 actual bytes and everything that goes after it. I'm almost sure this can't lead to access faults in debug mode and so on. *It's also obvious, that you can't have a "good-enough" approach if you only have `void*` pointer and nothing else, but why not give it a try...*
HardCoder1986
@HardCoder1986: Oh yes it can SEGFAULT your program easily due to accessing protected memory.
Vlad Lazarenko
@Vlad Oh, I doubt that. *Speaking this way, debuggers that receive pointer to non-null-terminated char sequence should also segfault.*
HardCoder1986
You can try peeking into the heap to find the size of the allocated block. Try checking `((long *)source)[-1]` and see what the value is.
TMN
@TMN Nice idea, but that would also work only if descriptor for the allocated block lies AT THE BEGINNING, not at the end. However, this is not true for some allocators.
HardCoder1986
Thanks a lot for your answers. There must be indeed a way because strace tools are able to print to screen any type of parameters. Consider this: strace /usr/games/fortune. Output: mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7832000_llseek(16, 0, [0], SEEK_CUR) = 0_llseek(16, 20480, [20480], SEEK_SET) = 0and so on. I will try the hardcoder way :)
flaab
I'm going nuts to compile this thing lol
flaab
@flaab Need any help?
HardCoder1986
@HardCoder1986, yes I do :) I have included iostream, the kUpperLimit constant and the function you just posted. I call it from my memcpy function. I am compiling with the following instruction: [gcc -Wall -dynamiclib -o libinterposers.dylib libinterposers.cpp]. And I get the following error message:
flaab
Undefined symbols: "___gxx_personality_v0", referenced from: ___gxx_personality_v0$non_lazy_ptr in cc5KdVvb.o "std::basic_ostream<char, std::char_traits<char> >::write(char const*, int)", referenced from: output_memory_dump(void*) in cc5KdVvb.o "std::ios_base::Init::~Init()", referenced from: ___tcf_0 in cc5KdVvb.o "std::ios_base::Init::Init()", referenced from: __static_initialization_and_destruction_0(int, int)in cc5KdVvb.o "std::cout", referenced from: __ZSt4cout$non_lazy_ptr in cc5KdVvb.old: symbol(s) not foundcollect2: ld returned 1 exit status
flaab
@flaab If you want to compile C++ code using `gcc`, you should link to `libstdc++` manually. Still, the best approach is to use `g++ whatever` instead.
HardCoder1986
@flaab Also, my variant of function is a C++ one. You could easily write the C variant of the same routine using `printf` or `putchar` instead of `std::cout.write`. Then, obviously if other code is pure C, you could build your project using `gcc`.
HardCoder1986
@HardCoder thanks for your answers and sorry for being such a noob. I have built the .so with g++ successfully (the same command but changing gcc for g++) but it does not work -it interposes nothing when executed-. It seems reinterpret_cast is not available in Ansi C -or so I think-. How can I link it to libstdc++ manually? What am I missing when using g++? Thanks a lot :!
flaab
@flaab It's hard to tell like this, because I basically don't know what you're trying to achieve. `.so` file is a shared object file (like dll in win32) and I can't understand what does it mean that "it doesn't work".
HardCoder1986
@flaa `reinterpret_cast<T>` is a C++ thing. If you want the same effect in C, use `(char*)value` - plain C cast.
HardCoder1986
@flaa If you have C++ code, use `g++`. If you are compiling a pure C program, use `gcc`. `g++` acts basically as a wrapper around `gcc`, that tells it that supplied files should be treated as C++ code. *So, basically, don't bother yourself with those `libstdc++` things, just use `gcc` and `g++` separately.*
HardCoder1986
@flaa You might find this link helpful: http://www.network-theory.co.uk/docs/gccintro/gccintro_54.html
HardCoder1986
@HardCoder1986 thank you very much everything. I'll give it a try with C standard casting (char*) adress and let you know how it went ;-)
flaab
+1  A: 

You can first apply some heuristics to the copied memory and based on that you can decide whether you want to print it.

static int maybe_string(const void *data, size_t n) {
  const unsigned char *p;
  size_t i;

  p = data;
  for (i = 0; i < n; i++) {
    int c = p[i];
    if (c == '\n' || c == '\r' || c == '\t')
      continue;
    if (1 <= c && c < 32)
      return 0; /* unusual ASCII control character */
    if (c == '\0' && i > 5)
      return 1; /* null-terminated and more than a few characters long */
  }

  return 0; /* not null-terminated, so it isn't a string */
}

This heuristic is not perfect. For example, it fails for the following pattern:

const char *str = "hello, world";
size_t len = strlen(str);
char *buf = malloc(1024);
memcpy(buf, str, len);
buf[len] = '\0';

If you want to catch that too, you will have to change the above function.

Roland Illig
Considering that he is interposing the standard `memcpy` function, this will slow down a system a lot. Not to mention that memory can contain any binary data including all bytes in ASCII range which does not mean it is string.
Vlad Lazarenko
You're right: When *every* byte is in ASCII range (and not `\0`), the memory doesn't contain a string. That's what the `return 0` in the last line is for. But otherwise the original poster sounded more like doing this out of curiosity and not for production use, so the slowdown is probably acceptable. I don't think printing a few extra "strings" hurts, I just wanted to make sure the terminal doesn't get confused by accidental control characters. Therefore the check for unusual ASCII control characters.
Roland Illig
A: 

No, you cannot figure this out from a pointer of void type. Plus, you don't know the size of source or destination, so the heuristic approach will not work. It will not work due to other reasons as well, for example, binary data stored in memory region pointed by void* can really have zero byte at the end, but that doesn't mean that it is string.

Vlad Lazarenko