ansaurus

Question

Delphi Search files and directories fastest alghorithm

Answer 1

A:

When I started to run into performance problems working with lots of small files on in the file system I moved to storing the files as blobs in database. There is no reason why related information like size, creation, and author couldn't also be stored in the database. Once the tables are populated in the database, I suspect that the database engine could do a much faster job of finding records (files) than any solution that we are going to come up with since Database code is highly specialized for efficient searches through large data sets. This will definitely be more flexible since adding a new search would be as simple as creating a new Select statement. Example: Select * from files where author = 'bob' and size > 10000

I'm not sure that approach will help you. Could you tell us more about what you are doing with these files and the search criteria.

William Leader 2010-06-16 15:15:57

Answer 2

+7 A:

I think any component that you'd buy, would also use findfirst/findnext. Recursively, of course. I don't think there's a way to look at every directory and file, without actually looking at every directory and file.

As a benchmark to see if your code is reasonably fast, compare performance against WinDirStat http://windirstat.info/ (Just to the point where it's gathered data, and is ready to build its graph of the space usage.)
Source code is available, if you want to see what they're doing. It's C, but I expect it's using the same API calls.

Chris Thornton 2010-06-16 15:55:28

yes, i'm afraid it is this way...thank you

Radu Barbu 2010-06-17 06:35:39

Maybe it's better to compare the performance against `RidNacs` from my website http://www.splashsoft.de. It's written in Delphi and uses findfirst/findnext and ... it's significantly faster than WinDirStat. :-)

splash 2010-06-23 15:19:13

Answer 3

+1 A:

If you want to get really fast search results consider using the Windows Search (API) or the Indexing service.

Other improvements might be to make use of threads and split the search for files and the gathering of file properties or just do a threaded search.

Remko 2010-06-16 17:52:00

thank you for this suggestion, but it's a little difficult to divide this recursive task into concurent threads...

Radu Barbu 2010-06-17 06:26:38

A multi-threaded search on a hard drive is going to be brutal for performance unless the volume being searched can handle the IOPS this will generate. Otherwise you're going to drive performance *down* because of disk thrashing instead of helping.

afrazier 2010-06-17 12:34:07

Answer 4

+1 A:

If you need to scan remote drive with that many files, I would strongly suggest doing so with a "client-server" design, so that the actual file scanning is always done locally and only the results are fetched remotely. That would save you a lot of time. Also, all "server" could scan in parallel.

Ken Bourassa 2010-06-16 21:09:17

probably will be something like this, using also RemObjects for sending the info to central server which will fill the database. thank you

Radu Barbu 2010-06-17 06:27:44

Answer 5

+1 A:

I once ran into a very similar problem where the number of files in the directory, coupled with findfirst/findnext was taking more time than was reasonable. With a few files its not an issue, but as you scale upwards into the thousands, or tens of thousands of files, then performance drops considerably.

Our solution was to use a queue file in a separate directory. As files are "added" to the system they were written to a queue file (was a fixed record file). When the system needed to process data, it would see if the file existed, and if so then rename it and open the renamed version (this way adds could occur for the next process pass). The file was then processed in order. We then archived the queue file & processed files into a subdirectory based on the date and time (for example: G:\PROCESSED\2010\06\25\1400 contained the files run at 2:00 pm on 6/25/2010).

Using this approach not only did we reach an almost "real-time" processing of files (delayed only by the frequency by which we processed the queue file), but we also insured processing of files in the order they were added.

skamradt 2010-06-16 23:36:56

this is the best solution...but, the program will not be installed like a task or like a service, to scan first time, and then hook the OS events in order to 'add' to queue the modifications. It is the best solution, but i am constrained by other 'specialized' people to do in some specific way. Solution that i found is: parse all the system for only the directories, make a temp file with this structure which will contain also a kind of indexes. after that, i will parse the structure, get all files for each element from the structure, and create another temporary file which will contain index

Radu Barbu 2010-06-17 06:32:53

from the first one and information needed. Thank you and best regards

Radu Barbu 2010-06-17 06:33:31

Answer 6

+1 A:

If your program is running on Windows 7 or Server 2008 R2, there are some enhancements to the Windows FindFirstFileEx function which will make it run a bit faster. You would have to copy and modify the VCL functions to incorporate the new options.

frogb 2010-06-16 23:41:16

i know that function but i'm constrained to use findfirst. thanks

Radu Barbu 2010-06-17 06:25:28

Answer 7

+1 A:

There isn't much room for optimization with a findfirst / findnext loop, because it's mostly I/O bound: the operating system needs to read this information from your HDD!

The proof: Make a small program that implements a simple findfirst / findnext loop that does NOTHING with the files it finds. Restart your computer and run it over your big directory, note the time it takes to finish. Then run it again, without restarting the computer. You'll notice the second run is significantly faster, because the operating system cached the information!

If you know for sure the directory you're trying to scan is heavily accessed by the OS because of some other application that's using the data (this would put the directory structure information into the OS's cache and make scanning not be bound to the I/O) you can try running several findfirst/findnext loops in parallel using threads. The down side of this is that if the directory structure is not allready in the OS cache your algorithm is again bound to HDD in/out and it might be worst then the original because you're now making multiple parallel I/O requests that need to be handled by the same device.

When I had to tackle this same problem I decided against parallel loops because the SECOND run of the application is allways so much faster, prooving I'm bound to I/O and no ammount of CPU optimisation would fix the I/O bottleneck.

Cosmin Prund 2010-06-17 05:36:45

Answer 8

+2 A:

mine is to use the muti-thread. first calculate the root folder size then assign each to a thread, let say 30 threads. then after the each thread finished you assgn it to some other directories whis is not scanned. loop until it fineshed. and off course you make an arrangement on your record files so that it will not scrambled. you can make it more threads, eats alot of cpu usage but you get faster results.

XBasic3000 2010-06-17 07:56:09

multi-threaded applications are tricky, application must be developed in a short time, and it is a real problem with synchronizing the threads.

Radu Barbu 2010-06-17 08:11:11

yes i agree with you. you can add WM_Message to keep track the thread. The hint are given already with K.Sandell just revise it. it will work for you.

XBasic3000 2010-06-17 08:49:51

Answer 9

+1 A:

I solved a similar problem by using two threads. This way I could "process" the file(s) at the same time as they where scanned from the disk. In my case the processing was significantly slower than scanning so I also had to limit the number of files in memory at one time.

TMyScanThread

Scan the file structure, for each "hit" add the path+file to a TList/TStringList or similar using Syncronize(). Remember to Sleep() inside the loop to let the OS have some time too.

PseudoCode for the thread:

TMyScanThread=class(TThread)
private
  fCount : Cardinal;
  fLastFile : String;
  procedure GetListCount;
  procedure AddToList;  
public
  FileList : TStringList;
  procedure Execute; Override;
end;

procedure TMyScanThread.GetListCount;
begin
  fCount := FileList.Count;
end;

procedure TMyScanThread.AddToList;
begin
  FileList.Add(fLastFile);
end;

procedure TMyScanThread.Execute;
begin
     try
        { Get the list size }
        Syncronize( GetListCount );
        if fCount<500 then
        begin
          // FindFirst code goes here
          { Add a file to the list }
          fLastFile := SR.Name; // Store Filename in local var
          Syncronize( AddToList ); // Call method to add to list
          SleepEx(0,True);
        end else
          SleepEx(1000,True);
     finally
        Terminate;
     end;
end;

TMyProcessFilesThread

Get the oldest entry in the list, and Process it. Then output results to DB.

This class is implemented similarly with Syncronized methods that access the list.

One alternative to the Syncronize() calls is to use a TCriticalSection. Implementing Syncronization between threads is often a matter of taste and the task at hand ...

K.Sandell 2010-06-17 08:20:51

Answer 10

+1 A:

You can also try BFS vs. DFS. This may affect your performance.

http://blogs.msdn.com/b/ericlippert/archive/2004/09/27/234826.aspx

http://en.wikipedia.org/wiki/Breadth-first_search

http://en.wikipedia.org/wiki/Depth-first_search

Alexander 2010-06-17 09:22:22

Answer 11

+1 A:

The one big thing you can do to really increase your performance is parse the MFT directly, if your volumes are NTFS. By doing this, you can enumerate files very, very quickly -- we're talking at least an order of magnitude faster. If all the metadata you need is part of the MFT record, your searches will complete much faster. Even if you have to do more reads for extra metadata, you'll be able to build up a list of candidate files very quickly.

The downside is that you'll have to parse the MFT yourself: There's no WinAPI functions for doing it that I'm aware of. You also get to worry about things that the shell normally does for you in worrying about things like hardlinks, junctions, reparse points, symlinks, shell links, etc.

However, if you want speed, the increase in complexity is the only way to achieve it.

I'm not aware of any available Delphi code that already implements an MFT parser, so you'll probably have to either use a 3rd party library or implement it yourself. I was going to suggest the Open Source (GPL) NTFS Undelete, which was written in Delphi, but it implements the MFT parsing via Python code and has a Delphi-Python bridge built in.

afrazier 2010-06-17 12:53:35

ansaurus

tags:

views:

answers:

Delphi Search files and directories fastest alghorithm

related questions