ansaurus

Question

How to obtain good concurrent read performance from disk.

Answer 1

+4 A:

I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.

Our solution for Windows is to pass the FILEFLAGNO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:

Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).

(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)

Edit to add some further details for Will Dean:

Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:

Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.

CPU usage on all of these systems was very low during the tests and anti-virus was disabled.

I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILEFLAGSEQUENTIAL_SCAN flag set. This flag didn't fix the problem.

pauldoo 2008-08-12 19:51:06

Answer 2

A:

I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.

Stacey Richards 2008-08-12 20:04:13

Answer 3

+1 A:

It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.

It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?

What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.

Will Dean 2008-08-12 20:20:17

per the single-drive vs raid5: if reading sequential data from two large enough files, you can't avoid *all* the disk heads seeking back and forth; the stripe size is typically 16-128kiB, so to read 1MiB of data, you need all (or most) of the heads to seek there.

ΤΖΩΤΖΙΟΥ 2008-10-10 11:23:38

Answer 4

A:

Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.

graham.reeds 2008-08-12 21:38:05

Answer 5

A:

Paul - saw the update. Very interesting.

It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.

My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.

Will Dean 2008-08-13 09:44:26

Answer 6

A:

@graham.reeds

We tried asynchronous IO using the FILEFLAGOVERLAPPED flag to CreateFile, which did not fix the problem. We haven't tried the IOCompletionPorts API that you mention but in general I'm doubtful that any asynchronous API will fix the problem.

The asynchronous APIs are mainly useful for letting application threads do other non-IO related work while waiting on IO. They are not about changing the way in which the OS performs the physical IO (this is in contrast to FILEFLAGNO_BUFFERING).

pauldoo 2008-08-13 09:46:20

Answer 7

A:

@Will Dean

We tried memory mapped IO and it actually made things worse. Perhaps Windows assumes that the application will be doing random access when working with a memory mapped file and implements different readahead algorithms. Perhaps if there was an equivalent of madvise (or fadvise) this would work.

pauldoo 2008-08-13 09:50:14

Answer 8

+4 A:

The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):

if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again

You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue. The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.

Andrea Bertani 2008-08-13 13:47:39

ansaurus

tags:

views:

answers:

How to obtain good concurrent read performance from disk.

related questions