views:

44

answers:

2

Consider a Linux driver that uses get_user_pages (or get_page) to map pages from the calling process. The physical address of the pages are then passed to a hardware device. Both the process and the device may read and write to the pages until the parties decide to end the communication. In particular, the communication may continue using the pages after the system call that calls get_user_pages returns. The system call is in effect setting up a shared memory zone between the process and the hardware device.

I'm concerned about what happens if the process calls fork (it could be from another thread, and could happen either while the syscall that calls get_user_pages is in progress or later). In particular, if the parent writes to the shared memory area after the fork, what do I know about the underlying physical address (presumably changed due to copy-on-write)? I want to understand:

  1. what the kernel needs to do to defend against a potentially misbehaving process (I don't want to create a security hole!);
  2. what restrictions the process need to obey so that the functionality of our driver works correctly (i.e. the physical memory remains mapped at the same address in the parent process).

    • Ideally, I would like the common case where the child process doesn't use our driver at all (it probably calls exec almost immediately) to work.
    • Ideally, the parent process should not have to take any special steps when allocating the memory, as we have existing code that passes a stack-allocated buffer to the driver.
    • I'm aware of madvise with MADV_DONTFORK, and it would be ok to have the memory disappear from the child process's space, but it's not applicable to a stack-allocated buffer.
    • “Don't use fork while you have a connection active with our driver” would be annoying, but acceptable as a last resort if point 1 is satisfied.

I'm willing to be pointed to documentation or source code. I've looked in particular at Linux Device Drivers, but didn't find this issue addressed. RTFS applied to even just the relevant part of the kernel source is a bit overwhelming.

The kernel version is not completely fixed but is a recent one (let's say ​≥2.6.26). We're only targetting Arm platforms (single-processor so far but multicore is just round the corner), if it matters.

+1  A: 

A fork() will not interfere with get_user_pages(): get_user_pages() will give you a struct page.

You would need to kmap() it before being able to access it, and this mapping is done in kernel space, not userspace.

EDIT: get_user_pages() touch the page table, but you should not be worried about this (it just make sure that the pages are mapped in userspace), and returns -EFAULT if it had any problem doing so.

If you fork(), until copy-on-write is performed, the child will be able to see that page. Once copy-on-write is done (because the child/the driver/the parent wrote to the page through the userspace mapping -- not the kernel kmap() the driver has), that page will no longer be shared. If you still hold a kmap() on the page (in the driver code), you will not be able to know if you are holding the parent page or the child's.

1) It's not a security hole, because once you execve(), all of that is gone.

2) When you fork() you want both process to be identical (It's a fork !!). I would think that your design should allow both the parent and the child to access the driver. Execve() will flush everything.

What about adding some functionality in userspace like:

 f = open("/dev/your_thing")
 mapping = mmap(f, ...)

When mmap() is called on your device, you install a memory mapping, with special flags: http://os1a.cs.columbia.edu/lxr/source/include/linux/mm.h#071

You have some interesting things like:

#define VM_SHARED       0x00000008
#define VM_LOCKED       0x00002000
#define VM_DONTCOPY     0x00020000      /* Do not copy this vma on fork */

VM_SHARED will disable copy on write VM_LOCKED will disable swapping on that page VM_DONTCOPY will tell the kernel not to copy the vma region on fork, although I don't think it's a good idea

Nicolas Viennot
Thanks for this interesting answer. I'm taking up the maintenance of existing code and just starting with Linux kernel programming, and hadn't caught on `kmap` as being relevant. I don't really care if the child can't access our driver; if the process forks, it would be for an unrelated purpose like `popen`. I don't control how the memory is allocated in userspace (we even have some code that passes a buffer on the stack to our driver). Is there a way for the *driver* to say “I want this physical page to remain mapped in the parent” (no matter how the parent obtained the page)?
Gilles
you can use the syscall `mlock()` which will basically add a VM_LOCKED flag on the targeted vma.
Nicolas Viennot
A: 

The short answer is to use madvise(addr, len, MADV_DONTFORK) on any userspace buffers you give to your driver. This tells the kernel that the mapping should not be copied from parent to child and so there is no CoW.

The drawback is that the child inherits no mapping at that address, so if you want the child to then start using the driver it will need to remap that memory. But that is fairly easy to do in userspace.

Update: A buffer on the stack is problematic, I'm not sure you can make it safe in general.

You can't mark it DONTFORK, because your child might be running on that stack page when it forks, or (worse in a way) it might do a function return later and hit the unmapped stack page. (I even tested this, you can happily mark your stack DONTFORK, bad things happen when you fork).

The other way to avoid a CoW is to create a shared mapping, but you can't map your stack shared for obvious reasons.

That means you risk a CoW if you fork. Even if the child "just" execs it might still touch the stack page and cause a CoW, leading to the parent getting a different page, which is bad.

The one minor point in your favor is that code using an on-stack buffer only needs to worry about code it calls forking, ie. you can't use an on-stack buffer after the function has returned. So you only need to audit your callees, and if they never fork you're safe, but that still may be infeasible, and is fragile if the code ever changes.

I think you really want to have all memory that is given to your driver to come from a custom allocator in userspace. It shouldn't be that intrusive. The allocator can either mmap your device directly, as the other answer suggested, or just use anonymous mmap, madvise(DONTFORK), and probably mlock() to avoid swap out.

mpe
@mpe: Actually I was already aware of `MADV_DONTFORK`, but it would be a restriction compare to what we do now (we have code that uses a buffer on the stack). I should have mentioned this in my question. If you have a long answer, I'd be interested to read it.
Gilles