views:

558

answers:

4

I've been reading up on Linux's "swappiness" tuneable, which controls how aggressive the kernel is about swapping applications' memory to disk when they're not being used. If you Google the term, you get a lot of pages like this discussing the pros and cons. In a nutshell, the argument goes like this:

If your swappiness is too low, inactive applications will hog all the system memory that other programs might want to use.

If your swappiness is too high, when you wake up those inactive applications, there's going to be a big delay as their state is read back off the disk.

This argument doesn't make sense to me. If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory? This seems to give the best of both worlds: if another application needs that memory, it can immediately claim the physical RAM and start writing to it, since another copy of it is on disk and can be swapped back in when the inactive application is woken up. And when the original app wakes up, any of its pages that are still in RAM can be used as-is, without having to pull them off the disk.

Or am I missing something?

+1  A: 

Even if you page the apps memory to disk and keep it in memory, you would still have to decide when should an application be considered "inactive" and that's what swapiness controls. Paging to disk is expensive in terms of IO and you don't want to do it too often. There is also another variable on this equation, and that is the fact that Linux uses of remaining memory as disk buffers/cache.

Dprado
+3  A: 

If I have an inactive application that's using a ton of memory, why doesn't the kernel page its memory to disk AND leave another copy of that data in-memory?

Lets say we did it. We wrote the page to disk, but left it in memory. A while later another process needs memory, so we want to kick out the page from the first process.

We need to know with absolute certainty whether the first process has modified the page since it was written out to disk. If it has, we have to write it out again. The way we would track this is to take away the process's write permission to the page back when we first wrote it out to disk. If the process tries to write to the page again there will be a page fault. The kernel can note that the process has dirtied the page (and will therefore need to be written out again) before restoring the write permission and allowing the application to continue.

Therein lies the problem. Taking away write permission from the page is actually somewhat expensive, particularly in multiprocessor machines. It is important that all CPUs purge their cache of page translations to make sure they take away the write permission.

If the process does write to the page, taking a page fault is even more expensive. I'd presume that a non-trivial number of these pages would end up taking that fault, which eats into the gains we were looking for by leaving it in memory.

So is it worth doing? I honestly don't know. I'm just trying to explain why leaving the page in memory isn't so obvious a win as it sounds.

(*) This whole thing is very similar to a mechanism called Copy-On-Write, which is used when a process fork()s. The child process is very likely going to execute just a few instructions and call exec(), so it would be silly to copy all of the parents pages. Instead the write permission is taken away and the child simply allowed to run. Copy-On-Write is a win because the page fault is almost never taken: the child almost always calls exec() immediately.

DGentry
A: 

According to this [1] that is exactly what Linux does.

I'm still trying to make sense of a lot of this, so any authoritative links would be appreciated.

BillTorpey
BillTorpey
A: 

The first thing the VM does is clean pages and move them to the clean list.
When cleaning anonymous memory (things which do not have an actual file backing store, you can see the segments in /proc//maps which are anonymous and have no filesystem vnode storage behind them), the first thing the VM is going to do is take the "dirty" pages and "clean" then by writing the contents of the page out to swap. Now when the VM has a shortage of completely free memory and is worried about its ability to grant new free pages to be used, it can go through the list of 'clean' pages and based on how recently they were used and what kind of memory they are it will move those pages to the free list.

Once the memory pages are placed on the free list, they no longer are associated with the contents they had before. If a program comes along a references the memory location the page was serving previously the program will take a major fault and a (most likely completely different) page will be grabbed from the free list and the data will be read into the page from disk. Once this is done, the page is actually still 'clean' since it has not been modified. If the VM chooses to use that page on swap for a different page in RAM then the page would be again 'dirtied', or if the app wrote to that page it would be 'dirtied'. And then the process begins again.

Also, swappinness is pretty horrible for server applications in a business/transactional/online/latency-sensitive environment. When I've got 16GB RAM boxes where I'm not running a lot of browsers and GUIs, I typically want all my apps nearly pinned in memory. The bulk of my RAM tends to be 8-10GB java heaps that I NEVER want paged to disk, ever, and the cruft that is available are processes like mingetty (but even there the glibc pages in those apps are shared by other apps and actually used, so even the RSS size of those useless processes are mostly shared, used pages). I normally don't see more than a few 10MBs of the 16GB actually cleaned to swap. I would advise very, very low swappiness numbers or zero swappiness for servers -- the unused pages should be a small fraction of the overall RAM and trying to reclaim that relatively tiny amount of RAM for buffer cache risks swapping application pages and taking latency hits in the running app.