views:

419

answers:

4

Inspired by this question

http://stackoverflow.com/questions/1237489/how-can-i-force-gdb-to-disassemble

and related to this one

http://stackoverflow.com/questions/1245809/what-is-int-21h

How does an actually system call happen under linux? what happens when the call is performed, until the actual kernel routine is invoked ?

+5  A: 

Assuming we're talking about x86:

  1. The ID of the system call is deposited into the EAX register
  2. Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
  3. An INT 0x80 interrupt is invoked.
  4. The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
  5. The calling code makes use of any results.

I may be a bit rusty at this, it's been a few years...

Adam Paynter
If I remember correctly, since the kernel has its own stack, no userspace program can put something on it, so all arguments have to be passed through the registers.
Benno
Really? I vaguely remember my professor mentioning Linux had some macros for mapping userspace addresses to kernelspace... I could be wrong.
Adam Paynter
INT 0x80 is only used where the SYSCALL/SYSENTER instructions are not available, IIRC.
Matthew Iselin
+3  A: 

Basically, its very simple: Somewhere in memory lies a table where each syscall number and the address of the corresponding handler is stored (see http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/syscall_table_32.S for the x86 version)

The INT 0x80 interrupt handler then just takes the arguments out of the registers, puts them on the (kernel) stack, and calls the appropriate syscall handler.

Benno
+3  A: 

This is already answered at
How is the system call in Linux implemented?
Probably did not match with this question because of the differing "syscall" term usage.

nik
+3  A: 

The given answers are correct but I would like to add that there are more mechanisms to enter kernel mode. Every recent kernel maps the "vsyscall" page in every process' address space. It contains little more than the most efficient syscall trap method.

For example on a regular 32 bit system it could contain:

 
0xffffe000: int $0x80
0xffffe002: ret

But on my 64-bitsystem I have access to the way more efficient method using the syscall/sysenter instructions


0xffffe000: push   %ecx
0xffffe001: push   %edx
0xffffe002: push   %ebp
0xffffe003:     mov    %esp,%ebp
0xffffe005:     sysenter 
0xffffe007: nop    
0xffffe008: nop    
0xffffe009: nop    
0xffffe00a: nop    
0xffffe00b: nop    
0xffffe00c: nop    
0xffffe00d: nop    
0xffffe00e:     jmp    0xffffe003
0xffffe010: pop    %ebp
0xffffe011: pop    %edx
0xffffe012: pop    %ecx
0xffffe013: ret    

This vsyscall page also maps some systemcalls that can be done without a context switch. I know certain gettimeofday, time and getcpu are mapped there, but I imagine getpid could fit in there just as well.

kmm