views:

100

answers:

2

I'm getting some strange performance results here and I'm hoping someone on stackoverflow.com can shed some light on this!

My goal was a program that I could use to test whether large seek's were more expensive than small seek's...

First, I created two files by dd'ing /dev/zero to seperate files... One is 1 mb, the other is 9.8gb... Then I wrote this code:

#define _LARGE_FILE_API
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

int main( int argc, char* argv[] )
{
  struct stat64 fileInfo;
  stat64( argv[1], &fileInfo );

  FILE* inFile = fopen( argv[1], "r" );

  for( int i = 0; i < 1000000; i++ )
    {
      double seekFrac = ((double)(random() % 100)) / ((double)100);

      unsigned long long seekOffset = (unsigned long long)(seekFrac * fileInfo.st_size);

      fseeko( inFile, seekOffset, SEEK_SET );
    }

    fclose( inFile );
}

Basically, this code does one million random seeks across the whole range of the file. When I run this under time, I get results like this for smallfile:

[developer@stinger ~]# time ./seeker ./smallfile

real    0m1.863s
user    0m0.504s
sys  0m1.358s

When I run it against the 9.8 gig file, I get results like this:

[developer@stinger ~]# time ./seeker ./bigfile

real    0m0.670s
user    0m0.337s
sys  0m0.333s

I ran against each file a couple dozen times and the results are consistent. Seeking in the large file is more than twice as fast as seeking in the small file. Why?

+13  A: 

You're not measuring disk performance, you're measuring how long it takes for fseek to set a pointer and return.

I recommend you do a file read from the location you're seeking to, if you want to test real IO.

Carl Smotricz
Wow... Ok, I added a getc() call after the seek to read a single character. Now, seeking in the large file is just slightly more expensive than seeking in the small file. Is there some optimization where multiple subsequent seeks are summed and actually done before the next IO? Wow...
dicroce
A seek() is just a hint to an operating system that you plan to read from somewhere next. The OS has a complicated scheduling mechanism to move disk heads in such a way as to minimize total travel time for all users. Since your reads get interleaved with everybody else's, it makes no sense to seek until at the last moment, when the OS (not your program, the OS!) is going to be doing the reading. So the OS keeps your seek position in the back of its mind but doesn't action it until it actually physically reads the data.
Carl Smotricz
A: 

I would assume that it has to do with the implementation of fseeko.

The man page of fseek indicates that it merely "sets the file position indicator for the indicated stream." Since setting an integer should be independent of the file size, perhaps there is an "optimization" that will perform an automatic read (and cache the resulting information) after an fseek for small files and not large files.

advait