ansaurus

Question

Quicker (quickest?) way to get number of files in a directory with over 200,000 files

Answer 1

+4 A:

You could use System.Management and WMI's class "cim_datafile", just run the following query in WMI, you can also use Linq to Wmi but i didn't try it

select * from cim_datafile where drive='c:' and path='\\SomeDirectory\\'

I guess it will work faster

ArsenMkrt 2009-07-28 09:29:50

Do you have any metrics for the above ?

Brian Agnew 2009-07-28 10:06:50

No I don't, but I think this will work faster than the code Brain provides, because this uses windows api, I will find a time to measure

ArsenMkrt 2009-07-28 10:13:49

Which code ? Who ?

Brian Agnew 2009-07-28 10:14:12

oops sorry Richard E

ArsenMkrt 2009-07-28 10:23:47

Ah. I would have thought that a .Net call for something like this is going to be pretty efficient, and most likely bounded by the IO performance. However the only way to tell is to measure.

Brian Agnew 2009-07-28 10:25:54

Answer 2

+4 A:

The file system is not designed for this layout. You'll have to reorganize it (to have fewer files per folder) if you want to work on that performance problem.

280Z28 2009-07-28 09:35:03

What would you recommend as a maximum number of files per folder?

Richard Ev 2009-07-28 09:45:51

Look at how Internet Explorer organizes Temporary Internet Files. That's a large number of little files, but they're not all in one folder.

John Saunders 2009-07-28 10:45:30

Answer 3

+2 A:

Not using the System.IO.Directory namespace, there isn't. You'll have to find a way of querying the directory that doesn't involve creating a massive list of files.

This seems like a bit of an oversight from Microsoft, the Win32 APIs have always had functions that could count files in a directory.

You may also want to consider splitting up your directory. How you manage a 200,000-file directory is beyond me :-)

Update:

John Saunders raises a good point in the comments. We already know that (general purpose) file systems are not well equipped to handle this level of storage. One thing that is equipped to handle huge numbers of small "files" is a database.

If you can identify a key for each (containing, for example, date, hour and customer number), these files should be injected into a database. The 4K record size and 108 million rows (200,000 rows/day * 30 days/month * 18 months) should be easily handled by most professional databases. I know that DB2/z would chew on that for breakfast.

Then, when you need some test data extracted to files, you have a script/program which just extracts the relevant records onto the file system. Then run your tests to successful completion and delete the files.

That should make your specific problem quite easy to do:

select count(*) from test_files where directory_name = '/SomeDirectory'

assuming you have an index on directory_name, of course.

paxdiablo 2009-07-28 09:36:52

The Win32 API has methods to enumerate the file system entries, but not to simply get the count. The `GetFiles` method the OP is using is implemented by calling the Win32 enumeration methods.

280Z28 2009-07-28 09:41:15

@Pax - the files in question represent daily transaction data, ~200,000 per day which we have to have available for 18 months. We will have to subsequently access around 3% of all files based on customer enquiries. That's as far as the management goes...

Richard Ev 2009-07-28 09:47:51

Why not use a database?

John Saunders 2009-07-28 10:46:14

@280Z28, you're right, it was always a matter of doing a findfirst/findnext type of operation. However, at no point was it necessary to have the entire list of files in memory at once, just the current one being processed. While the code in the question may use those APIs under the hood, it uses them to get a list and then calculates the length of the list. This is inefficient, but I can see no way of doing a first/next with the System.IO.Directory stuff.

paxdiablo 2009-07-29 04:10:47

Answer 4

+3 A:

The code you've got is slow because it first gets an array of all the available files, then takes the length of that array.

However, you're almost certainly not going to find any solutions that work much faster than that.

Why?

Access controls.

Each file in a directory may have an access control list - which may prevent you from seeing the file at all.

The operating system itself can't just say "hey, there are 100 file entries here" because some of them may represent files you're not allowed to know exist - they shouldn't be shown to you at all. So the OS itself has to iterate over the files, checking access permissions file by file.

For a discussion that goes into more detail around this kind of thing, see two posts from The Old New Thing:

[As an aside, if you want to improve performance of a directory containing a lot of files, limit yourself to strictly 8.3 filenames. No I'm not kidding - it's faster, because the OS doesn't have to generate an 8.3 filename itself, and because the algorithm used is braindead. Try a benchmark and you'll see.]

Bevan 2009-07-28 09:47:10

8.3 filename creation under NTFS is one of the things we're planning to disable on the production system.

Richard Ev 2009-07-28 09:49:09

+1 for making the point about access controls

Richard Ev 2009-07-28 09:49:44

Answer 5

+5 A:

I had a very similar problem with a directory containing (we think) ~300,000 files.

After messing with lots of methods for speeding up access (all unsuccessful) we solved our access problems by reorganising the directory into something more hierarchical.

We did this by creating directories a-z, representing the first letter of the file, then sub-directories for each of those, also containing a-z for the second letter of the file. Then we inserted the files in the related directory

e.g.

gbp32.dat

went in

g/b/gbp32.dat

and re-wrote our file access routines appropriately. This made a massive difference, and it's relatively trivial to do (I think we moved each file using a 10-line Perl script)

Brian Agnew 2009-07-28 09:49:38

At the moment we are considering placing 24 hourly folders inside our daily folders for this very reason.

Richard Ev 2009-07-28 10:00:26

Tat sounds like it would help. It's worth knocking up a dummy hierarchy and seeing how easy it is to traverse. That's what we did, and once we'd moved stuff we never had any more problems

Brian Agnew 2009-07-28 10:04:17

We've done this (a folder per hour with date) and it works very well for us

Binary Worrier 2009-07-28 10:04:35

Answer 6

+1 A:

Create an index every day at midnight. Finding a file will go very fast then. And counting the number of files is just as trivial.

If I see it right, you have one dir for each day. If all files you receive today go in the map of today then this system can be improved. Just index the directory of the previous day at midnight.

Carra 2009-07-28 14:06:11

Answer 7

+3 A:

FYI, .NET 4 includes a new method, Directory.EnumerateFiles, that ~~does exactly what you need~~ is awesome. Chances are you're not using .NET 4, but it's worth remembering anyway!

Edit: I now realise that the OP wanted the NUMBER of files. However, this method is so useful I'm keeping this post here.

Richard Szalay 2009-07-28 14:13:50

We're using .NET 3.5 :-)

Richard Ev 2009-07-28 14:25:01

Answer 8

+1 A:

If you are not afraid of calling win32 functions it might be worth trying FIndFirstFile then iterating with FindNextFile. This saves over overhead of allocating all those strings just to get a count.

Dolphin 2009-07-29 03:12:09

Answer 9

A:

If I'm using a slowish high-level language, and portability wasn't a big concern, I'd be tempted to try calling an external program (eg `ls | wc`.first.to_i if using ruby and unix), but then I'd check whether it does the job any better.

Andrew Grimm 2009-11-24 05:36:19

ansaurus

tags:

views:

answers:

Quicker (quickest?) way to get number of files in a directory with over 200,000 files

Edit

related questions