views:

265

answers:

4

I have a few situations where I need to list files recursively, but my implementations have been slow. I have a directory structure with 92784 files. find lists the files in less than 0.5 seconds, but my Haskell implementation is a lot slower.

My first implementation took a bit over 9 seconds to complete, next version a bit over 5 seconds and I'm currently down to a bit less than two seconds.

listFilesR :: FilePath -> IO [FilePath]
listFilesR path = let
    isDODD "." = False
    isDODD ".." = False
    isDODD _ = True

    in do
        allfiles <- getDirectoryContents path
    dirs <- forM allfiles $ \d ->
      if isDODD d then
        do let p = path </> d
           isDir <- doesDirectoryExist p
           if isDir then listFilesR p else return [d]
        else return []
    return $ concat dirs

The test takes about 100 megabytes of memory (+RTS -s), and the program spends around 40% in GC.

I was thinking of doing the listing in a WriterT monad with Sequence as the monoid to prevent the concats and list creation. Is it likely this helps? What else should I do?

Edit: I have edited the function to use readDirStream, and it helps keeping the memory down. There's still some allocation happening, but productivity rate is >95% now and it runs in less than a second.

This is the current version:

list path = do
  de <- openDirStream path
  readDirStream de >>= go de
  closeDirStream de
  where
    go d [] = return ()
    go d "." = readDirStream d >>= go d
    go d ".." = readDirStream d >>= go d
    go d x = let newpath = path </> x
         in do
          e <- doesDirectoryExist newpath
          if e 
        then
          list newpath >> readDirStream d >>= go d
        else putStrLn newpath >> readDirStream d >>= go d 
+1  A: 

One problem is that it has to construct the entire list of directory contents, before the program can do anything with them. Lazy IO is usually frowned upon, but using unsafeInterleaveIO here cut memory use significantly.

listFilesR :: FilePath -> IO [FilePath]
listFilesR path = 
  let
    isDODD "." = False
    isDODD ".." = False
    isDODD _ = True
  in unsafeInterleaveIO $ do
    allfiles <- getDirectoryContents path
    dirs <- forM allfiles $ \d ->
      if isDODD d then
        do let p = path </> d
           isDir <- doesDirectoryExist p
           if isDir then listFilesR p else return [d]
        else return []
    return $ concat dirs
sabauma
That shaved off about 0.4 seconds and 20 megabytes. So a little bit better
Masse
+4  A: 

I think that System.Directory.getDirectoryContents constructs a whole list and therefore uses much memory. How about using System.Posix.Directory? System.Posix.Directory.readDirStream returns an entry one by one.

Also, FileManip library might be useful although I have never used it.

Tsuyoshi Ito
I made a version using System.Posix.Directory and iteratees, it didn't do much if any better. One odd thing I found was that System.Posix.Directory doesn't seem to provide the functionality I'd expect. "readdir" returns a pointer to a "struct dirent", but it seems the only thing you can get from a DirectoryStream is the filename - which means you have to make another call (presumably to stat() via doesDirectoryExist) to find out if it's a directory. That could be a part of the problem as well - find doesn't need to make another syscall to discover whether it's a directory or not.
mokus
@mokus: Thanks for the info. In Posix systems, reading directory by [readdir](http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html) does not return whether the returned entry is a directory or not, and therefore you need a separate syscall (usually stat or lstat) to decide if it is a directory. Therefore, the behavior of System.Posix.Directory you described is not odd. Some implementations of the find command use the hardlink-counting trick to omit unnecessary calls to stat, which makes the traversal faster.
Tsuyoshi Ito
@Tsuyoshi Ito: On my system (Mac OS), "struct dirent" has a field "d_type", one possible value of which is "DT_DIR". Wikipedia hints that this is optional in the POSIX spec, but it sure would be a strong case for DirectoryStream to provide an 'isDir' or 'fileType' operation that would use that info if available and call stat otherwise. Even if it's not a required standard, if his platform has it, I'd be shocked if find isn't using it.
mokus
Tsuyoshi Ito
+2  A: 

Profiling your code shows that most of the CPU time goes in getDirectoryContents, doesDirectoryExist and </>. This means that only changing the data structure won't help very much. If you want to match the performance of find you should use lower level functions for accessing the filesystem, probably the ones which Tsuyoshi pointed out.

Daniel Velkov
+1  A: 

Would it be an option to use some sort of cache system combined with the read? I was thinking of an async indexing service/thread that kept this cache up-to-date in the background, perhaps you could do the cache as a simple SQL-DB which would then give you some nice performance when doing queries against it?

Can you elaborate anything on your "project/idea" so we can come up with something alternative?

I wouldn't go for a "full index" myself as I mostly build webbased services and "resposnetime" is criticial to me, on the other hand - if its an initial way of starting up a new server I am sure the customers wouldnt mind waiting that first time. I would just store the result in the DB for later lookups.

BerggreenDK
I'm always open to new ideas. I'm writing a wrapper for HyperEstraier, a full text search engine, for desktop use. I'm a heavycommand line user, so I was thinking of doing a native gatherer andsearcher.At the moment I have converted my bash-script to Haskell, but it stilluses the estcmd commands for gathering and searching, and the systemprocess calls are ugly. And for the native gatherer I need to parseevery file at least with the first pass. But I can't think of a way tolist only files that are added or modified since the last time.
Masse
ok - what kinda OS are you building for? Eg. Windows has "directory events" for new files, renaming etc. if you have some sort of "root" folder you might be able to put a "root event handler" with recursive triggering. havent tried it myself, but I would look in that direction after indexing the catalog first time.
BerggreenDK