views:

153

answers:

6

When I read a large file in the file system, can the cache improve the speed of the operation?

I think there are two different answers:

1.Yes. Because cache can prefetch thus the performance gets improved.

2.No. Because the speed to read from cache is more faster than the speed to read from disk, at the end we can find that the cache doesn't help,so the reading speed is also the speed to read from disk.

Which one is correct? How can I testify the answer?

[edit]

And another questions is :

What I am not sure is that, when you turn on the cache the bandwidth is used to

1.prefetch

2.prefetch and read

which one is correct?

While if you turn off the cache ,the bandwith of disk is just used to read.

If I turn off the cache and randomly access the disk, is the time needed comparable with the time when read sequentially with the cache turned on?

+3  A: 

1 is definitely correct. The operating system can fetch from the disk to the cache while your code is processing the data it's already received. Yes, the disk may well still be the bottleneck - but you won't have read, process, read, process, read, process, but read+process, read+process, read+process. For example, suppose we have processing which takes half the time of reading. Representing time going down the page, we might have this sort of activity without prefetching:

Read
Read
Process
Read
Read
Process
Read
Read
Process

Whereas with prefetching, this is optimised to:

Read
Read
Read     Process
Read
Read     Process
Read
         Process

Basically the total time will be "time to read whole file + time to process last piece of data" instead of "time to read whole file + time to process whole file".

Testing it is tricky - you'll need to have an operating system where you can tweak or turn off the cache. Another alternative is to change how you're opening the file - for instance, in .NET if you open the file with FileOptions.SequentialScan the cache is more likely to do the right thing. Try with and without that option.

This has spoken mostly about prefetching - general caching (keeping the data even after it's been delivered to the application) is a different matter, and obviously acts as a big win if you want to use the same data more than once. There's also "something in between" where the application has only requested a small amount of data, but the disk has read a whole block - the OS isn't actively prefetching blocks which haven't been requested, but can cache the whole block so that if the app then requests more data from the same block it can return that data from the cache.

Jon Skeet
Do you sleep? ;-)
Blank Xavier
Yes, but now I have to get my eldest son changed after swimming. Back later :)
Jon Skeet
@Jon Skeet..:But is the time to process the last piece of data comparable with the time to prefetch the next piece of data? And another questions is :If I turn off the cache and randomly access the disk, is the time needed comparable with the time when sequentially read with the cache turned on?
MainID
@Jinx: That entirely depends what you're doing with the data! If you're doing some complex encryption, you may end up spending more time processing than reading. If you're just counting lines of text, the processing time will be smaller than the IO.
Jon Skeet
If you need to access the disk *randomly* (i.e. not reading a file from start to finish, but jumping around) then the cache is less likely to be able to help.
Jon Skeet
+3  A: 

First answer is correct.

The disk has a fixed underlying performance - but that fixed underlying performance differs in different circumstances. You obtain better real performance from a drive when you read long sections of data - e.g. when you cache ahead. So caching permits the drive to achieve genuine improvement its real performance.

Blank Xavier
A: 

If the files are larger than your memory, then it definitely has no way of helping.

+1  A: 

Jon Skeet has a very interesting benchmark with .NET about this topic. The basic result was that prefetching helps, the more processing per unit read you have to do.

David Schmitt
+2  A: 

In the general case, it will be faster with the cache. Some points to consider:

  • The data on the disk is organized in surfaces (aka heads), tracks and blocks. It takes the disk some time to position the reading heads so that you can start reading a track. Now you need five blocks from that track. Unfortunately, you ask for then in a different order than they appear on the physical media. The cache will help greatly by reading the whole track into memory (lots more blocks than you need), then reindex them (when the head starts to read, it probably will be anywhere on the track, not on the start of the first block). Without this, you'd have to wait until the first block of the track rotates under the head and start reading -> the time to read a track would be effectively doubled. So with a cache, you can read the blocks of a track in any order and you start reading as soon as the head arrives over the track.

  • If the file system is pretty full, the OS will start to squeeze your data into various empty spaces. Imagine block 1 is on track 5, block 2 is on track 7, block 3 is again on track 5. Without a cache, you'd loose a lot of time for positioning the head. With a cache, track 5 is read, kept in RAM as the head goes to track 7 and when you ask for block 3, you get it immediately.

  • Large files need a lot of meta-data, namely where the data blocks for the file are. In this case, the cache will keep this data live as you read the file, saving you from a lot more head trashing.

  • The cache will allow other programs to access their data in an efficient way as you hog the disk. So overall performance will be better. This is very important when a second program starts to write as you read. In this case, the cache will collect some writes before it interrupts your reads. Also, most programs read data, process it and then write it back. Without the cache, a program would either get into its own way or it would have to implement its own caching scheme to avoid head trashing.

  • A cache allows the OS to reorder the disk I/O. Say you have blocks on track 5, 7 and 13 but file order asks for track 5, 13 and then 7. Obviously, it's more efficient to read track 7 on the way to 13 rather than going all the way to 13 and then come back to 7.

So while theoretically, reading lots of data would be faster without a cache, this is only true if your file is the only one on the disk and all meta-data is ordered perfectly, the physical layout of the data is in such a way that the reading heads always start reading a track at the start of the first block, etc.

Aaron Digulla
A: 

Another point: Chances are, frequently used files will be in the cache before one is even starting to read one of them.

Ingo