tags:

views:

145

answers:

7

I just wonder is parallel File.Read using PLINQ/Parallel can be faster? My code is as follows ( .Net 4.0):

public static void ReadFileParallel(List<string> fileName)
{
   Parallel.Foreach(fileName, file=>File.Read(file));
}

public static void ReadFilePLINQ(List<string> fileName)
{
    fileName.AsParallel().foreach(file=>File.Read(file));
}

The reason I ask this is because I thought that file reading is IO bound, so doing parallel won't help, am I right?

A: 

There is an excellent PDF from MSFT which goes into detail regarding Parallel and threading possibilities.

It might help.

http://www.microsoft.com/downloads/details.aspx?FamilyID=86b3d32b-ad26-4bb8-a3ae-c1637026c3ee&amp;displaylang=en

keyle
+2  A: 

It depends.

If your files were in different locations, on different network shares, or on different physical hard drives, then yes, parallel loading will probably help. If they're on a single spinning hard drive, reading the files in parallel will probably hurt your performance significantly due to the extra seek time that you will likely incur from these parallel reads.

If your files are on an SSD, you will probably get slightly less performance, but it would depend on how many files you're reading in parallel and what their sizes are. I imagine that at a certain file size threshold and number of parallel reads, performance will drop significantly. Hard to tell on that one without some experimentation.

Dave Markle
Those are reasonable criteria. In practice, I'd say measure it rather than guessing, though.
Steven Sudit
A: 

You'd think so, but that's not what measurements show. When file I/O has significant latency, particularly over networks, doing it in parallel can keep the pipe filled.

Steven Sudit
A: 

To a first approximation, it will help if the files are on different disks and make it slower otherwise (due to increased time spent seeking).

It might be slightly faster if all the files are cached (since you can use multiple cores).

Your best bet, is of course, to run some benchmarks.

tc.
A: 

You are not exactly doing a parallel File.Read, you are doing multiple File.Reads in parallel. If the files are in different spindles you will experience improved throughput just by utilizing multiple spindles at once.

You can also experience improved performance even if you use a single spindle, if each Read is followed by CPU-bound processing, although in this case it would be much better to and schedule Tasks objects. In this case you can have some tasks loading data from files while others use already loaded data to execute some heavy processing.

Panagiotis Kanavos
Yeah, but if his files are on the same HDD, he'll hit the head search time, and the throughput will decrease much worse then 2 times.Remember the average seek time for an 3.5" 7200 RPM drive is 13-15 milliseconds. And unlike capacity and linear read/write rate, this figure is consistent over the last several years.
Soonts
That's why I said "each read followed by CPU-bound processing". While one thread is reading the file, another is doing processing, thus keeping both of them working.
Panagiotis Kanavos
A: 

I think that you've pretty much hit the nail on the head here.

Parallel Operations in general are always throttled by the point at which you run out of resources to run the operations in parallel on, but even then you still have diminishing returns on an increasing amount of parallel threads.

Jeff Atwood tweeted up an interesting graph which I will add to this later showing the diminishing returns of mutli-core processors with a multi-threading environment. Granted this is not exactly the same. But let's look at this from the thought that even if you had 100 files on 100 hard drives, somewhere that IO is getting pipped back down a single channel which will cause some diminishing of the read increase.

What I'm basically trying to say is just running something in parallel doesn't mean that it will be sped up dramatically, it's important to consider how the parallel processes are actually being executed.

msarchet
A: 

It is tricky business. If you do it wrong, the disk head is moving back and forth trying to read two files at the same time. This is especcially a problem on larger files.

However, if you read a lot of small files in parallel, you may gain a little because the disk subsystem can choose to read the files in a different order than you asked. I have not seen this effect in real life, however.

Also processing you do on the content can be done in parallel with the reading of the files. So you need to profile and benchmark before you ship.

jdv