views:

195

answers:

7

Hi all, on a linux box with plenty of memory (a few Gigs), I need to access randomly to a big file as fast as possible.

I was thinking about doing a cat myfile > /dev/null before accessing it so my file pages go in memory sequentially, hence faster than with a dry random access.

Does this approach make sense to you?

+4  A: 

Only one way to be sure that any (possibly premature?) optimization is worthwhile: benchmark it.

Paul Dixon
+2  A: 

It could theoretically speed up the access (especially if you access almost everything from the file), but I wouldn't bet on a big difference.

The only really useful approach is to benchmark it for your specific case.

Joachim Sauer
A: 

No, it doesn't. cat has to execute, and while you are waiting for that, your program could have been doing the real work.

anon
but the question is really: would a sequential scan prior to that work make the subsequent random accesses faster due to caching?
Paul Dixon
But the answer is the same. Doing the scan takes time.
anon
+6  A: 

While doing that may force the contents of the file into the system's cache, you are better off using posix_fadvise() (with the POSIX_FADV_WILLNEED advice) or the (blocking)readahead() call to make the kernel precache the data you will need.

EDIT: You might also want to try using the POSIX_FADV_RANDOM advice to disable readahead altogether. There's an article with a decent explanation of usage here: Advising the Linux Kernel on File I/O

Hasturkun
+2  A: 

As the others said, you'll need to benchmark it in your particular case.

It is quite possible it will result in a significant performance increase though. On traditional rotating media (i.e. a hard disk) sequential access (cat file > /dev/null/fadvise) is much faster than random access.

Kristof Provost
+1  A: 

If you really want the speed I'd recommend trying memory-mapped IO instead of trying to hack something up with cat. Of course, it depends on the size of file you're trying to access and the type of access you want.. this may not be possible...

readahead is a good call too...

Thomi
A: 

Doing "cat" on a big file might bring the data in and blow more valuable data out of the cache; this is not what you want.

If performance is at all important to you, you'll be doing regular performance testing anyway (and soak tests etc), so continue to do that and watch your graphs, figures etc.

MarkR