ansaurus

Question

Answer 1

+1 A:

One problem is that it has to construct the entire list of directory contents, before the program can do anything with them. Lazy IO is usually frowned upon, but using unsafeInterleaveIO here cut memory use significantly.

listFilesR :: FilePath -> IO [FilePath]
listFilesR path = 
  let
    isDODD "." = False
    isDODD ".." = False
    isDODD _ = True
  in unsafeInterleaveIO $ do
    allfiles <- getDirectoryContents path
    dirs <- forM allfiles $ \d ->
      if isDODD d then
        do let p = path </> d
           isDir <- doesDirectoryExist p
           if isDir then listFilesR p else return [d]
        else return []
    return $ concat dirs

sabauma 2010-10-07 13:53:28

That shaved off about 0.4 seconds and 20 megabytes. So a little bit better

Masse 2010-10-07 14:29:28

Answer 2

+4 A:

I think that System.Directory.getDirectoryContents constructs a whole list and therefore uses much memory. How about using System.Posix.Directory? System.Posix.Directory.readDirStream returns an entry one by one.

Also, FileManip library might be useful although I have never used it.

Tsuyoshi Ito 2010-10-07 17:59:15

I made a version using System.Posix.Directory and iteratees, it didn't do much if any better. One odd thing I found was that System.Posix.Directory doesn't seem to provide the functionality I'd expect. "readdir" returns a pointer to a "struct dirent", but it seems the only thing you can get from a DirectoryStream is the filename - which means you have to make another call (presumably to stat() via doesDirectoryExist) to find out if it's a directory. That could be a part of the problem as well - find doesn't need to make another syscall to discover whether it's a directory or not.

mokus 2010-10-07 23:09:49

@mokus: Thanks for the info. In Posix systems, reading directory by [readdir](http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html) does not return whether the returned entry is a directory or not, and therefore you need a separate syscall (usually stat or lstat) to decide if it is a directory. Therefore, the behavior of System.Posix.Directory you described is not odd. Some implementations of the find command use the hardlink-counting trick to omit unnecessary calls to stat, which makes the traversal faster.

Tsuyoshi Ito 2010-10-08 00:52:21

@Tsuyoshi Ito: On my system (Mac OS), "struct dirent" has a field "d_type", one possible value of which is "DT_DIR". Wikipedia hints that this is optional in the POSIX spec, but it sure would be a strong case for DirectoryStream to provide an 'isDir' or 'fileType' operation that would use that info if available and call stat otherwise. Even if it's not a required standard, if his platform has it, I'd be shocked if find isn't using it.

mokus 2010-10-08 01:01:54

Tsuyoshi Ito 2010-10-08 02:38:59

Answer 3

+2 A:

Profiling your code shows that most of the CPU time goes in getDirectoryContents, doesDirectoryExist and </>. This means that only changing the data structure won't help very much. If you want to match the performance of find you should use lower level functions for accessing the filesystem, probably the ones which Tsuyoshi pointed out.

Daniel Velkov 2010-10-07 19:03:42

Answer 4

+1 A:

Would it be an option to use some sort of cache system combined with the read? I was thinking of an async indexing service/thread that kept this cache up-to-date in the background, perhaps you could do the cache as a simple SQL-DB which would then give you some nice performance when doing queries against it?

Can you elaborate anything on your "project/idea" so we can come up with something alternative?

I wouldn't go for a "full index" myself as I mostly build webbased services and "resposnetime" is criticial to me, on the other hand - if its an initial way of starting up a new server I am sure the customers wouldnt mind waiting that first time. I would just store the result in the DB for later lookups.

BerggreenDK 2010-10-07 23:04:36

I'm always open to new ideas. I'm writing a wrapper for HyperEstraier, a full text search engine, for desktop use. I'm a heavycommand line user, so I was thinking of doing a native gatherer andsearcher.At the moment I have converted my bash-script to Haskell, but it stilluses the estcmd commands for gathering and searching, and the systemprocess calls are ugly. And for the native gatherer I need to parseevery file at least with the first pass. But I can't think of a way tolist only files that are added or modified since the last time.

Masse 2010-10-08 04:19:20

ok - what kinda OS are you building for? Eg. Windows has "directory events" for new files, renaming etc. if you have some sort of "root" folder you might be able to put a "root event handler" with recursive triggering. havent tried it myself, but I would look in that direction after indexing the catalog first time.

BerggreenDK 2010-10-09 01:34:09

ansaurus

tags:

views:

answers:

How to list directories faster?

related questions