tags:

views:

110

answers:

1

Hi there,

I have a problem that was described in multiple threads concerning memory mapping and a growing memory consumption under Linux.

When I open a 1GB file under Linux or MacOS X and map it into memory using

me.data_begin = mmap(NULL, capacity(me), prot, MAP_SHARED, me.file.handle, 0);

and sequentially read the mapped memory, my program uses more and more physical memory although I used posix_madvise (even called it multiple times during the read process):

posix_madvise(me.data_begin, capacity(me), MMAP_SEQUENTIAL);

without success. :-(

I tried:

  • different flags MMAP_RANDOM, MMAP_DONTNEED, MMAP_NORMAL without success
  • posix_fadvise(me.file.handle, 0, capacity(me), POSIX_FADV_DONTNEED) before and after calling mmap -> no success

It works under Mac OS X !!! when I combine

posix_madvise(.. MMAP_SEQUENTIAL)

and

msync(me.data_begin, capacity(me), MS_INVALIDATE).

The resident memory is below 16M (I periodically called msync after 16mio steps).

But under Linux nothing works. Does anyone has an idea or a success story for my problem under Linux?

Cheers, David

+4  A: 

Linux memory management is different from other systems. The key principle is that memory that is not being used is memory being wasted. In many ways, Linux tries to maximize memory usage, resulting (most of the time) in better performance.

It is not that "nothing works" in Linux, but that its behavior is a little different than you expect.

When memory pages are pulled from the mmapped file, the operating system has to decide which physical memory pages it will release (or swap out) in order to use. It will look for pages which are easier to swap out (don't require immediate disk write) and are less likely to be used again.

The madvice() POSIX call serves to tell the system how your application will use the pages. But as the name says, it is an advice so that the operating system is better instrumented in taking paging and swapping decisions. It is neither a policy nor an order.

To demonstrate the effects of madvice() on Linux, I modified one of the exercises I give to my students. See the complete source code here. My system is 64-bit and has 2 GB of RAM, which about 50% is in use now. Using the program to mmap a 2 GB file, read it sequentially and discard everything. It reports RSS usage every 200 MB is read. The results without madvice():

<juliano@home> ~% ./madvtest file.dat n
     0 :     3 MB
   200 :   202 MB
   400 :   402 MB
   600 :   602 MB
   800 :   802 MB
  1000 :  1002 MB
  1200 :  1066 MB
  1400 :  1068 MB
  1600 :  1078 MB
  1800 :  1113 MB
  2000 :  1113 MB

Linux kept pushing things out of memory until around 1 GB was read. After that, it started pressuring the process itself (since the other 50% of memory was active by the other processes) and stabilized until the end of the file.

Now, with madvice():

<juliano@home> ~% ./madvtest file.dat y
     0 :     3 MB
   200 :   202 MB
   400 :   402 MB
   600 :   494 MB
   800 :   501 MB
  1000 :   518 MB
  1200 :   530 MB
  1400 :   530 MB
  1600 :   530 MB
  1800 :   595 MB
  2000 :   788 MB

Note that Linux decided to allocate pages to the process only until it reached around 500 MB, much sooner than without madvice(). This is because after that, the pages currently in memory seemed much more valuable than the pages that were marked as sequential access by this process. There is a threshold in the VMM that defines when to start dropping old pages from the proccess.

You may ask why Linux kept allocating pages up to around 500 MB and didn't stop much sooner, since they were marked as sequential access. It is that either the system had enough free memory pages anyways, or the other resident pages were too old to keep around. Between keeping ancient pages in memory that don't seem to be useful anymore, and bringing more pages to serve a program that is running now, Linux chooses the second option.

Even if they were marked as sequential access, it was just an advice. The application may still want to go back to those pages and read them again. Or another application in the system. The madvice() call says only what the application itself is doing, Linux takes in consideration the bigger picture.

Juliano
Thank you Juliano, that 50% behaviour is interesting. I just wonder why there is no way to enforce Linux to free pages that I never read again. Instead it sacrifices buffers and caches of the file system.On MacOS X sacrificing these buffer stalls the system until it is completely unusable. But fortunately we can prevent that via *msync(... MS_INVALIDATE)*On Linux it seems to be the behavior you observed with madvice that prevents the system from stalling.
Dave
@Dave: consider that there is no point to prematurely free those pages. Linux is not sacrificing caches and buffer, instead, it is doing exactly that. As you read more data from disk, Linux has to bring these to memory anyways. It sort of caches what was read from the disk, but instead of accounting it as "cache", it accounts it as part of the part of the RSS of the process which mapped that file. When Linux needs cache again, it will free those pages mapped to that application. You don't need to be concerned about that!
Juliano