views:

76

answers:

3

I need to find a method to read a big number of small files (about 300k files) as fast as possible.

Reading them sequentially using FileStream and reading the entire file in a single call takes between 170 and 208 seconds (you know, you re-run, disk cache plays its role and time varies).

Then I tried using PInvoke with CreateFile/ReadFile and using FILE_FLAG_SEQUENTIAL_SCAN, but I didn't appreciate any changes.

I tried with several threads (divide the big set in chunks and have every thread reading its part) and this way I was able to improve speed just a little bit (not even a 5% with every new thread up to 4).

Any ideas on how to find the most effective way to do this?

A: 

My guess is that you're going to be constrained by the low-level file access code, physical disk activity etc. Mulitple threads could end up just thrashing the disk. How much control do you have on where these files are and waht happens when they are created?

Could you arrange for them to be on a solid-state disk rather than a physical disk?

Could you load the data into a database as it arrives. Then Your searches would be across a (possibly indexed) database?

djna
+1  A: 

As @djna has told you, your disk is probably only capable of servicing one thread at a time, so multiple threads in your program won't help and may actually make things worse. The variance in execution time for the single-threaded version of your code seems to be well in excess of the time saving from multi-threading. In other words, the statistical significance of the apparent improvement in execution time is 0.

One option that you might consider is moving to a parallel I/O system which is designed for multi-threaded access. This is a big step however, only suitable if you are doing this sort of operation regularly.

Another option would be to distribute the files across local disks on networked systems and have each system work through a portion of the files. How easy it is for you to implement this, well you don't tell us enough for us to give good advice on that, so think about it.

High Performance Mark
A: 

I would load all files once, save as a big file. Then your application can load just the one file and scan the 300k files for only those that have changed (by size, modified date or deleted/added), applying those changes to the in-memory big file.

You said they were small files so I assume that 300k files can all be loaded at once - if not then you must only need a subset of the original 300k files anyway, so the big file can just be that subset.

The only way this wouldn't work is if something else is writing the 300k files every time your application runs and that sounds unlikely.

Enigmativity