views:

528

answers:

6

I have a list of file names, and I want to search a directory and all its subdirectories. These directories contain about 200,000 files each. My code finds the the file but it takes about 20 minutes per file. Can someone suggest a better method?

Code Snippet

String[] file_names = File.ReadAllLines(@"C:\file.txt");
foreach(string file_name in file_names) 
{
    string[] files = Directory.GetFiles(@"I:\pax\", file_name + ".txt",
                                        SearchOption.AllDirectories);
    foreach(string file in files)
    {
        System.IO.File.Copy(file, 
                            @"C:\" + 
                            textBox1.Text + @"\N\O\" + 
                            file_name + 
                            ".txt"
                            );
    }

}
+10  A: 

If you're searching for multiple files in the same directory structure, you should find all the files in that directory structure once, and then search through them in memory. There's no need to go to the file system again and again.

EDIT: There's an elegant way of doing this, with LINQ - and the less elegant way, without. Here's the LINQ way:

using System;
using System.IO;
using System.Linq;

class Test
{
    static void Main()
    {
        // This creates a lookup from filename to the set of 
        // directories containing that file
        var textFiles = 
            Directory.GetFiles("I:\\pax", "*.txt", SearchOption.AllDirectories)
                     .ToLookup(file => Path.GetFileName(file),
                               file => Path.GetDirectoryName(file));

        string[] fileNames = File.ReadAllLines(@"c:\file.txt");
        // Remove the quotes for your real code :)
        string targetDirectory = "C:\\" + "textBox1.Text" + @"\\N\\O\\";

        foreach (string fileName in fileNames)
        {
            string tmp = fileName + ".txt";
            foreach (string directory in textFiles[tmp])
            {
                string source = Path.Combine(directory, tmp);
                string target = Path.Combine(targetDirectory, tmp);
                File.Copy(source, target);                                       
            }
        }
    }
}

Let me know if you need the non-LINQ way. One thing to check before I do so though - this could copy multiple files over the top of each other. Is that really what you want to do? (Imagine that a.txt exists in multiple places, and "a" is in the file.)

Jon Skeet
Wow. Now I see why Jon Skeet gets all the points - he answers the fastest!
codekaizen
i'd have to see an example code for what you mean, i'm to new to understand the structure from pure text
Mike
and almost my list of files might be like 2000 files long
Mike
Good answer as always Jon. Do you think for tasks like these scripting languages do a better job? Know some people who use perl/python for tasks like these.
Perpetualcoder
we have a structure in place for perl, i just want to put together a process that will encapsulate a GUI
Mike
@Perpetualcoder: No, not really - in a scripting language you'd still have to get all the files into an appropriate data structure to start with. There's a bit of verbosity here in terms of declaring a class etc, but that's all.
Jon Skeet
Thanks Jon, i tried otu the code you listed i'm sitting at 16 minutes will no files being copied yet.
Mike
@Mike: That would suggest that it hasn't even finished finding all the files yet. Just how many files do you have under `I:\pax`?
Jon Skeet
ohhh right aroudn 2.9mil...
Mike
A case for .net 4.0 Parallel extensions already ??
Perpetualcoder
@Mike: It would have been worth telling us that to start with. There's little you can do to speed that up - it *will* take a long time to search for the filesystem to report that many filenames. @Perpetualcoder: Absolutely not - there's a single bottleneck here: the file system. Using multiple threads would make it worse, not better.
Jon Skeet
@Jon - you're works, I would suggest to anybody reading this post that the method was not a terrible increase in speed. Jon, i also opened a new tread with an idea to fix this issue, if you get a moment take a look if you could. Thanks. http://stackoverflow.com/questions/1921781/c-massive-search-and-copy-part2
Mike
@Mike: It wouldn't make the first file copy any faster, but it should be very fast after that. The file system will have some caching, but with 2.9 million entries it may well not cache everything.
Jon Skeet
@Jon, heh it actually told me a server with 32 gigs of memory ran out of memory, heh
Mike
+2  A: 

You're probably better off trying to load all the file paths into memory. Call Directory.GetFiles() once, and put the results into a HashSet<String>. Then do lookups on the HashSet. This will work fine if you have enough memory. It would be easy to try.

If you run out of memory, you'll have to be smarter, like by using a buffer cache. The easiest way to do this is to load all the file paths as rows into a database table, and have the query processor do the work of managing the buffer cache for you.

Here's code for the first:

String[] file_names = File.ReadAllLines(@"C;\file.txt");
HashSet<string> allFiles = new HashSet<string>();
string[] files = Directory.GetFiles(@"I:\pax\", file_name + ".txt", SearchOption.AllDirectories);
foreach (string file in files)
{
    allFiles.Add(file);
}

foreach(string file_name in file_names)
{
    String file = allFiles.FirstOrDefault(f => f == file_name);
    if (file != null)
    {
        System.IO.File.Copy(file, @"C:\" + textBox1.Text + @"\N\O\" + file_name + ".txt");
    }
}

You could be even smarter on memory usage by traversing the directories one at a time and adding the resulting file array to the hashset. That way all the filenames would have to exist in a big String[].

codekaizen
i tried still takes about 20 minutes
Mike
Per file!? I find that very hard to believe... Are you sure you moved the "Directory.GetFiles()" call _out_ of the loop?
codekaizen
A: 

At a glance it appears that there are .NET APIs to call the Windows Indexing service... provided the machine you're using has indexing enabled (and I'm also unsure if the aforementioned service refers to the XP-era Indexing Service or the Windows Search indexing service).

Google Search

One possible lead

Another

STW
A: 

Try using LINQ to query the filesystem. Not 100% sure of performance but it is really easy to test.

var filesResult = from file in new DirectoryInfo(path).GetFiles("*.txt", SearchOption.AllDirectories)
                  where file.Name = filename
                  select file;

Then just do whatever you want with the result.

robber.baron
A: 

Scanning a directory structure is an IO intensive operation, whatever you do, the first GetFiles() call will take the majority of time, by the end of the first call probably most of the file information will be in the file system cache and second call will return in no time when compared to the first call (depending on your free memory and file system cache size).

Probably your best option is turning on indexing on the file system and somehow using it; Querying the Index Programmatically

Sinan Komut
A: 

The Linq answer may run into problems, because it loads all the file names into memory before it starts selecting from them. Generally, you might want to load the contents of a single directory at a time, to reduce memory pressure.

However, for a problem like this, you might want to go up one level in the problem formulation. If this is a query you do often, then you could build something that uses a FileSystemListener to listen for changes in the top directory and all directories below it. Prime it on start-up by walking all the directories and building them into a Dictionary<> or HashSet<>. (Yes, this has the same memory problem as the Linq solution). Then, when you get file add/delete/rename modifications, update the dictionary. That way, each individual query can be answered very quickly.

If this is queries from a tool that's invoked a lot, you probably want to build the FileSystemWatcher into a service, and connect to / query that service from the actual tool that needs to know, so that the file system information can be built up once, and re-used for the lifetime of the service process.

Jon Watte
Oh, and Windows Indexing may be able to already do that for you -- except it's not guaranteed to be an in-core index (and in fact, it isn't).And another way of speeding it up is to move to SSDs. Really, spinning magnetic media is rapidly going the way of the dinosaurs.
Jon Watte