views:

93

answers:

1

It used to be the case that if you needed to make a system call directly in linux without the use of an existing library, you could just include <linux/unistd.h> and it would define a macro similar to this:

#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \
type name(type1 arg1,type2 arg2,type3 arg3) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
  : "=a" (__res) \
  : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \
      "d" ((long)(arg3))); \
if (__res>=0) \
  return (type) __res; \
errno=-__res; \
return -1; \
}

Then you could just put somewhere in your code:

_syscall3(ssize_t, write, int, fd, const void *, buf, size_t, count);

which would define a write function for you that properly performed the system call.

It seems that this system has been superseded by something (i am guessing that "[vsyscall]" page that every process gets) more robust.

So what is the proper way (please be specific) for a program to perform a system call directly on newer linux kernels? I realize that I should be using libc and let it do the work for me. But let's assume that I have a decent reason for wanting to know how to do this :-).

+1  A: 

OK, So I looked into it further since I didn't get much of a response here, and found some good information. First when an application is launched in linux, in addition to the tradition argc, argv, envp parameters. There is another array passed with some more data called auxv. See here for details.

One of these key/value pairs has a key equivalent to AT_SYSINFO. Defined in either /usr/include/asm/auxvec.h or /usr/include/elf.

The value associated with this key is the entry point to the system call function (in the "vdso" or "vsyscall" page mapped into every process.

You could just replace the tradition int 0x80 or syscall instructions with a call to this address and it would actually do the system call. Unfortunately, this is ugly. So the libc folks come up with a nice solution. When they allocate the TCB and assign it to the gs segment. They put the value of AT_SYSINFO in some fixed offset in the TCB (unfortunately it isn't fixed across versions so you can't rely on the offset being a the same constant always). So instead of a traditional int 0x80 you can just say call *%gs:0x10 which will call the system call routine found in the vdso section.

I suppose the goal here is to make writing libc easier. This allows the libc guys to write one block of code to deal with system calls and not have to worry about it ever again. The kernel guys can change how system calls are done at any point in time, they just need to change the contents of the vdso page to use the new mechanism and it's good to go. In fact, you wouldn't need to even recompile your libc! However, this does make things a pain in the butt for us people writing inline assembly and trying to play with the things under the hood.

Fortunately the old way still works too if you really want to do things manually :-).

EDIT: one thing I've noticed with my experiements is that AT_SYSINFO doesn't seem to be given to the program on my x86_64 box (AT_SYSINFO_EHDR is, but i'm not sure how to make use of that yet). So I'm not 100% sure how the address of the system call function is determined in this situation.

Evan Teran