views:

4414

answers:

14

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

A: 

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.

Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

Leon Timmermans
+13  A: 

mmap is way faster. You might write a simple benchmark to prove it to yourself:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

versus:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.

If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).

A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(

Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.

Ben Collins
RE: mmaped file cache on Windows. Exactly: when file buffering is enabled, the kernel memory maps the file you're reading internally, reads into that buffer and copies it back into your process. It's as if you memory mapped it yourself except with an extra copy step.
Chris Smith
I'm loath to disagree with an accepted answer, but I believe this answer is wrong. I followed your suggestion and tried your code, on a 64bit Linux machine, and mmap() was no faster than the STL implementation. Also, theoretically I would not expect 'mmap()' to be any faster (or slower).
Tim Cooper
@Tim Cooper: you may find this thread (http://markmail.org/message/zbxnldcz6dlgo5of#query:mmap%20vs%20blocks+page:1+mid:zbxnldcz6dlgo5of+state:results) of interest. Note the two things: mmap isn't properly optimized in Linux, and one need also use madvise in their test to get best results.
Ben Collins
FWIW, I also played around reading chunks off disk (e.g., 4096 bytes at a time) and didn't see any improvement from, essentially, reading a record at a time. I saw another relevant post somewhere on mmap() vs. read() in mmap()'s favor. I'll try and dig it up.
jbl
Dear Ben: I've read that link. If 'mmap()' is not faster on Linux, and MapViewOfFile() is not faster on Windows, then can you make the claim that "mmap is way faster"? Also, for theoretical reasons I believe mmap() is no faster for sequential reads - do you have any explanation to the contrary?
Tim Cooper
Here's a link to someone who's done the copy file test with a variety of methods. http://lkml.org/lkml/2008/1/14/491. I think I probably need to write the mmap version of my code to fully resolve this question. I was hoping not to have to!
jbl
There seems to be evidence going both ways, so I'm going to unaccept this until there's some more definitive evidence from me or others.
jbl
More details on my tests to disprove Ben's assertion: I ran 3 programs, Ben's 2 programs plus a version using C-style open/read. I got these results: open/read: 7m42s, ifstream: 8m27s, mmap: 8m08s on a 2-year old Linux machine with 2.5Gb RAM, 4.9Gb datafile, calculating a simple xor checksum.
Tim Cooper
@Tim Cooper: My tests are probably not good benchmarks, because mmapping a single page and reading one page's worth of data at a time using istream.read() likely amount to the same operations under the hood. The reason for my assertion is because of experience in trying to do it both ways. For example, I've written code in which I tried to open very large files for reading, and found that mmap was /way/ faster just as I claim. A better benchmark would test different page sizes, different file sizes, different loop constructs, etc. In the end, I think most would find mmap to be faster.
Ben Collins
A: 

I don't think I can get around scanning the input several times. It contains a sparse matrix that needs to be multiplied with vectors over and over and won't fit in physical memory. That said, at least the file access pattern is sequential (which is why the buffering idea might work).

Maybe the question should be, what are the downsides to mmap'ing huge files? I guess it depends on the underlying VM subsystem, but I haven't been able to find enough information to help me make an informed choice.

jbl
A: 

Ben - the sliding window mmap sounds interesting. Can you say a little more about it?

The main thing I'm worried about with mmap is that I'm betting that records will often be broken by page boundaries, so in order to process a record, I'd have to essentially go back and forth between two pages dynamically (or else just increase the size of the mmap'd region) as I go along.

Breaking up the input is a good idea. I like that.

jbl
+1  A: 

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).

Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

Douglas Leeder
A: 

I agree with Douglas, you should probably want to preprocess it. It us not very likely that this dataset can't be cut up into pieces. You said the content is sparse, how are you storing it? Compacted or not?

Leon Timmermans
A: 

Cutting the content up into mmap'able pieces is definitely a good approach and easily workable. I'm kind of annoyed I hadn't thought of it now.

The records are matrix rows in compressed sparse row format. Each record has a row number, a field for the number of non-zero records in the row (nnz), nnz column indices, and nnz values.

| rownum | nnz | col1, col2, ..., colnnz | val1, val2, ..., valnnz |
jbl
+2  A: 

@jbl:

the sliding window mmap sounds interesting. Can you say a little more about it?

Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).

Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

Ben Collins
A: 

This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

Pat Notz
Yep. I have been thinking about that and will probably try it out in a later release. The only reservation I have is that the processing is far shorter than the I/O latency, so there may not be much benefit.
jbl
+1  A: 

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.

Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.

The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.

If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.

This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

mlbrock
+8  A: 

The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.

I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.

I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....

In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:

  1. There is random access (not sequential) within the file, AND
  2. the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
  3. OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.

(btw - I love mmap()/MapViewOfFile()).

Tim Cooper
+1  A: 

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?

Ben Collins wrote:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

I would suggest also trying:

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

ceretullis
A: 

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...

mike
A: 

I think the greatest thing about mmap is potential for asynchronous reading with:

    while( size_left > 0 ) {
        r = min(MMAP_SIZE, size_left);
        addr2 = mmap(NULL, r,
            PROT_READ, MAP_FLAGS,
            0, pos);
        feed_data(ctx, addr1, r);
        munmap(addr1, r);
        addr1 = addr2;
        size_left -= r;
        pos += r;
    }

Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap. I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

ony