views:

611

answers:

5

When recursively traversing through a directory structure, what is the most efficient algorithm to use if you have more files than directories? I notice that when using depth-first traversal, it seems to take longer when there are a lot of files in a given directory. Does breadth-first traversal work more efficiently in this case? I have no way to profile the two algorithms at the moment so your insights are very much welcome.

EDIT: In response to alphazero's comment, I'm using PHP on a Linux machine.

+1  A: 

it makes sense that breadth-first would work better. when you enter your root folder, you create a list of items you need to deal with. some of those items are files and some are directories.

if you use breadth-first, you would deal with the files in the directory and forget about them before moving on to one of the child directories.

if you use depth-first, you need to keep growing a list of files to deal with later as you drill deeper down. this would use more memory to maintain your list of files to deal with, possibly causing more page faults, etc... plus, you'd need to go through the list of new items anyway to figure out which ones are directories that you can drill into. you would need to go through that same list (minus the directories) again when you've gotten to the point of dealing with the files.

Igor
The description for your 'breadth-first' search is pre-order depth first search.
Pete Kirkham
A: 

Travse directory structure using BFS (as Igor mentioned).

When you reach a directory start a thread to list all the files in the directory.

And kill the thread once it finishes listing/travseing files.

So,there will be separate thread for each directory to list files.

EXAMPLE:

root

  - d1
    - d1.1
    - d1.2
    - f1.1 ... f1.100

  - d2
    - d2.1
    - d2.2
    - d2.3
    - f2.1 ... f2.200

  - d3 
    ....

OUTPUT might look like this ->

 got d1

   started thread to get files of d1

   got d2

   started thread to get files of d1

   done with files in d1

   got d3

   started thread to get files of d1

   got d1.1
   started thread to get files of d1.1

   got d1.2
   started thread to get files of d1.2

So by the time you come back to travse the depths of a directory the thread to get files would have finished(almost) its job.

Hope this is helpful.

TheMachineCharmer
A: 

This would be the most effective in Windows (class DirectoryTreeReader), it uses breath first and stores every directory.

static const uint64 DIRECTORY_INDICATOR = -1;//std::numeric_limits <uint64>::max();

class DirectoryContent {

public:
    DirectoryContent(const CString& path)
    : mIndex(-1) 
    {
     CFileFind finder;
     finder.FindFile(path + L"\\*.*");
     BOOL keepGoing = FALSE;
     do {
      keepGoing = finder.FindNextFileW();
      if (finder.IsDots()) {
       // Do nothing...
      } else if (finder.IsDirectory()) {
       mPaths.push_back(finder.GetFilePath());
       mSizes.push_back(DIRECTORY_INDICATOR);
      } else {
       mPaths.push_back(finder.GetFilePath());
       mSizes.push_back(finder.GetLength());
      }
     } while(keepGoing);
    }

    bool OutOfRange() const {
     return mIndex >= mPaths.size();
    }
    void Advance() {
     ++mIndex;
    }
    bool IsDirectory() const {
     return mSizes[mIndex] == DIRECTORY_INDICATOR;
    }
    const CString& GetPath() const {
     return mPaths[mIndex];
    }
    uint64 GetSize() const {
     return mSizes[mIndex];
    }

private:
    CStrings mPaths;
    std::vector <uint64> mSizes;
    size_t mIndex;
};

class DirectoryTreeReader{
    DirectoryTreeReader& operator=(const DirectoryTreeReaderRealtime& other) {};
    DirectoryTreeReader(const DirectoryTreeReaderRealtime& other) {};

public:
    DirectoryTreeReader(const CString& startPath)
    : mStartPath(startPath){
     Reset();
    }

    void Reset() {
     // Argh!, no clear() in std::stack
     while(!mDirectoryContents.empty()) {
      mDirectoryContents.pop(); 
     }
     mDirectoryContents.push( DirectoryContent(mStartPath) );
     Advance();
    }
    void Advance() {
     bool keepGoing = true;
     while(keepGoing) {
      if (mDirectoryContents.empty()){
       return;
      }
      mDirectoryContents.top().Advance();
      if (mDirectoryContents.top().OutOfRange()){
       mDirectoryContents.pop();
      } else if ( mDirectoryContents.top().IsDirectory() ){
       mDirectoryContents.push( DirectoryContent(mDirectoryContents.top().GetPath()) );
      } else {
       keepGoing = false;
      }
     }
    }
    bool OutOfRange() const {
     return mDirectoryContents.empty();
    }
    const CString& GetPath() const {
     return mDirectoryContents.top().GetPath();
    }
    uint64 GetSize() const {
     return mDirectoryContents.top().GetSize();
    }

private:
    const CString mStartPath;
    std::stack <DirectoryContent> mDirectoryContents;
};
Viktor Sehr
+2  A: 

Since you have more files than directories, it does not appear as if you are dealing with very deeply nested directories that would make DFS to take more memory (and hence somewhat more time) than BFS. Essentially, BFS and DFS both do the same thing (i.e. visit every node of the graph), and so in general their speeds should not differ by any significant amount.

It is difficult to say why exactly your DFS is slower without actually seeing your implementation. Are you sure you are not visiting the same nodes more than once due to links/shortcuts in your filesystem? You will also probably get a significant speedup if you use an explicit stack based DFS rather than recursion.

MAK
+1  A: 

You probably only want to scan the contents in a directory once per directory, so processing order - whether you process a directory's contents before or after visiting other directories probably matters more than whether or not you're doing a depth-first or breadth-first search. Depending on your file system, it may also be more efficient to process file nodes sooner rather than later than stating them to see if they are files or directories. So I'd suggest an pre-order depth-first search as starting point, as easiest to implement and most likely to have good cache/seek performance.

In summary - pre-order depth-first - On entering a directory, list its contents, process any files in that directory, and save a list of child directory names. Then enter each child directory in turn. Just use the program's call stack as stack, unless you know you have vastly deep directory structures.

Pete Kirkham