tags:

views:

1094

answers:

9

I have some directories containing test data, typically over 200,000 small (~4k) files per directory.

I am using the following C# code to get the number of files in a directory:

int fileCount = System.IO.Directory.GetFiles(@"C:\SomeDirectory").Length;

This is very, very slow however - are there any alternatives that I can use?

Edit

Each folder contains data for one day, and we will have around 18 months of directories (~550 directories). I am also very interested in performance enhancements people have found by reworking flat directory structures to more nested ones.

+4  A: 

You could use System.Management and WMI's class "cim_datafile", just run the following query in WMI, you can also use Linq to Wmi but i didn't try it

select * from cim_datafile where drive='c:' and path='\\SomeDirectory\\'

I guess it will work faster

ArsenMkrt
Do you have any metrics for the above ?
Brian Agnew
No I don't, but I think this will work faster than the code Brain provides, because this uses windows api, I will find a time to measure
ArsenMkrt
Which code ? Who ?
Brian Agnew
oops sorry Richard E
ArsenMkrt
Ah. I would have thought that a .Net call for something like this is going to be pretty efficient, and most likely bounded by the IO performance. However the only way to tell is to measure.
Brian Agnew
+4  A: 

The file system is not designed for this layout. You'll have to reorganize it (to have fewer files per folder) if you want to work on that performance problem.

280Z28
What would you recommend as a maximum number of files per folder?
Richard Ev
Look at how Internet Explorer organizes Temporary Internet Files. That's a large number of little files, but they're not all in one folder.
John Saunders
+2  A: 

Not using the System.IO.Directory namespace, there isn't. You'll have to find a way of querying the directory that doesn't involve creating a massive list of files.

This seems like a bit of an oversight from Microsoft, the Win32 APIs have always had functions that could count files in a directory.

You may also want to consider splitting up your directory. How you manage a 200,000-file directory is beyond me :-)

Update:

John Saunders raises a good point in the comments. We already know that (general purpose) file systems are not well equipped to handle this level of storage. One thing that is equipped to handle huge numbers of small "files" is a database.

If you can identify a key for each (containing, for example, date, hour and customer number), these files should be injected into a database. The 4K record size and 108 million rows (200,000 rows/day * 30 days/month * 18 months) should be easily handled by most professional databases. I know that DB2/z would chew on that for breakfast.

Then, when you need some test data extracted to files, you have a script/program which just extracts the relevant records onto the file system. Then run your tests to successful completion and delete the files.

That should make your specific problem quite easy to do:

select count(*) from test_files where directory_name = '/SomeDirectory'

assuming you have an index on directory_name, of course.

paxdiablo
The Win32 API has methods to enumerate the file system entries, but not to simply get the count. The `GetFiles` method the OP is using is implemented by calling the Win32 enumeration methods.
280Z28
@Pax - the files in question represent daily transaction data, ~200,000 per day which we have to have available for 18 months. We will have to subsequently access around 3% of all files based on customer enquiries. That's as far as the management goes...
Richard Ev
Why not use a database?
John Saunders
@280Z28, you're right, it was always a matter of doing a findfirst/findnext type of operation. However, at no point was it necessary to have the entire list of files in memory at once, just the current one being processed. While the code in the question may use those APIs under the hood, it uses them to get a list and then calculates the length of the list. This is inefficient, but I can see no way of doing a first/next with the System.IO.Directory stuff.
paxdiablo
+3  A: 

The code you've got is slow because it first gets an array of all the available files, then takes the length of that array.

However, you're almost certainly not going to find any solutions that work much faster than that.

Why?

Access controls.

Each file in a directory may have an access control list - which may prevent you from seeing the file at all.

The operating system itself can't just say "hey, there are 100 file entries here" because some of them may represent files you're not allowed to know exist - they shouldn't be shown to you at all. So the OS itself has to iterate over the files, checking access permissions file by file.

For a discussion that goes into more detail around this kind of thing, see two posts from The Old New Thing:

[As an aside, if you want to improve performance of a directory containing a lot of files, limit yourself to strictly 8.3 filenames. No I'm not kidding - it's faster, because the OS doesn't have to generate an 8.3 filename itself, and because the algorithm used is braindead. Try a benchmark and you'll see.]

Bevan
8.3 filename creation under NTFS is one of the things we're planning to disable on the production system.
Richard Ev
+1 for making the point about access controls
Richard Ev
+5  A: 

I had a very similar problem with a directory containing (we think) ~300,000 files.

After messing with lots of methods for speeding up access (all unsuccessful) we solved our access problems by reorganising the directory into something more hierarchical.

We did this by creating directories a-z, representing the first letter of the file, then sub-directories for each of those, also containing a-z for the second letter of the file. Then we inserted the files in the related directory

e.g.

gbp32.dat

went in

g/b/gbp32.dat

and re-wrote our file access routines appropriately. This made a massive difference, and it's relatively trivial to do (I think we moved each file using a 10-line Perl script)

Brian Agnew
At the moment we are considering placing 24 hourly folders inside our daily folders for this very reason.
Richard Ev
Tat sounds like it would help. It's worth knocking up a dummy hierarchy and seeing how easy it is to traverse. That's what we did, and once we'd moved stuff we never had any more problems
Brian Agnew
We've done this (a folder per hour with date) and it works very well for us
Binary Worrier
+1  A: 

Create an index every day at midnight. Finding a file will go very fast then. And counting the number of files is just as trivial.

If I see it right, you have one dir for each day. If all files you receive today go in the map of today then this system can be improved. Just index the directory of the previous day at midnight.

Carra
+3  A: 

FYI, .NET 4 includes a new method, Directory.EnumerateFiles, that does exactly what you need is awesome. Chances are you're not using .NET 4, but it's worth remembering anyway!

Edit: I now realise that the OP wanted the NUMBER of files. However, this method is so useful I'm keeping this post here.

Richard Szalay
We're using .NET 3.5 :-)
Richard Ev
+1  A: 

If you are not afraid of calling win32 functions it might be worth trying FIndFirstFile then iterating with FindNextFile. This saves over overhead of allocating all those strings just to get a count.

Dolphin
A: 

If I'm using a slowish high-level language, and portability wasn't a big concern, I'd be tempted to try calling an external program (eg `ls | wc`.first.to_i if using ruby and unix), but then I'd check whether it does the job any better.

Andrew Grimm