views:

237

answers:

3

Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <$> getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))

This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath] version.

What am I doing wrong? I've never used the List package's ListT outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.

+1  A: 

Running it on a large directory reveals a memory leak. I suspect this has to do with the strictness of getDirectoryContents, but there might be more going on. Simple profiling didn't turn up much, I'd add some extra cost centers and go from there...

sclv
+3  A: 

Profiling shows that join (together with doesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

If in the root directory of the search there are k subdirectories and their contents are already computed in the lists: d1, d2, ... dk, then after applying join you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk). Since x ++ y takes time O(length x) the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1)). If we assume that the number of files is n and they are evenly distributed between d1 ... dk then the time to compute join would be O(n*k) and that is only for the first level of listFiles.

This, I think, is the main performance problem with your solution.

Daniel Velkov
The `concat` in Masse's original implementation (which is much faster) seems to be doing exactly the same thing, but I'll look into this in more detail.
Travis Brown
Concat gets turned by fusion rules into the following, but I'm not clear if it is necessarily faster, and I'm not up for benchmarking at the moment: ` "concat" forall xs. concat xs = build (\c n -> foldr (\x y -> foldr c y x) n xs)`
sclv
To both comments. ListT defines its own list type, so there is some added indirection in all operations on it.
Daniel Velkov
@Daniel: Isn't it just newtyped?
Porges
It's defined like this: data ListItem l a = Nil | Cons { headL :: a, tailL :: l a }
Daniel Velkov
+2  A: 

I'm curious, how well does the same program written to use logict work for you? LogicT is semantically the same as ListT, but implemented in continuation-passing style so that it shouldn't have the concat-related type of problems you seem to be running into.

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <$> getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
Reid Barton