I have a few situations where I need to list files recursively, but my implementations have been slow. I have a directory structure with 92784 files. find
lists the files in less than 0.5 seconds, but my Haskell implementation is a lot slower.
My first implementation took a bit over 9 seconds to complete, next version a bit over 5 seconds and I'm currently down to a bit less than two seconds.
listFilesR :: FilePath -> IO [FilePath]
listFilesR path = let
isDODD "." = False
isDODD ".." = False
isDODD _ = True
in do
allfiles <- getDirectoryContents path
dirs <- forM allfiles $ \d ->
if isDODD d then
do let p = path </> d
isDir <- doesDirectoryExist p
if isDir then listFilesR p else return [d]
else return []
return $ concat dirs
The test takes about 100 megabytes of memory (+RTS -s), and the program spends around 40% in GC.
I was thinking of doing the listing in a WriterT monad with Sequence as the monoid to prevent the concats and list creation. Is it likely this helps? What else should I do?
Edit: I have edited the function to use readDirStream, and it helps keeping the memory down. There's still some allocation happening, but productivity rate is >95% now and it runs in less than a second.
This is the current version:
list path = do
de <- openDirStream path
readDirStream de >>= go de
closeDirStream de
where
go d [] = return ()
go d "." = readDirStream d >>= go d
go d ".." = readDirStream d >>= go d
go d x = let newpath = path </> x
in do
e <- doesDirectoryExist newpath
if e
then
list newpath >> readDirStream d >>= go d
else putStrLn newpath >> readDirStream d >>= go d