ansaurus

Question

Answer 1

+2 A:

Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.

Well, you'll need to use a different IO mechanism. Either:

Strict IO (process the files with Data.ByteString or System.IO.Strict
or, Iteratee IO (for experts only at the moment).

I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.

For example, you can replace your lazy IO with System.IO.Strict, yielding:

import qualified System.IO.Strict as S

getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
    where
        getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
                                 (S.readFile path)

Don Stewart 2010-06-05 19:53:40

The given code did not work because S.readFile gives a [Char] but hash needs a [Word8]. Unpacking a strict ByteString, however, did work.The downside is that each file is read into memory in its entirety instead of being read lazily by hash, so the program crashes when run on a directory containing a 10.9GB Blu-ray image with the error "<program>: out of memory (requested 11732516864 bytes)".How do I fix this?See: http://haskell.pastebin.com/srbB8bFF

Jesse 2010-06-06 03:21:01

Don't unpack bytestrings! This is really important info you're providing: if they're 10G you will need to use lazy IO, and just ensure to hClose the handle when you're done.

Don Stewart 2010-06-06 07:41:41

I used `unpack` so the data is in `[Word8]` format for `hash`, but as you said that causes memory usage to blow out. Using TomMD's pureMD5 library is a much better solution. Thanks.

Jesse 2010-06-07 03:38:30

Answer 2

A:

EDIT: sorry, thought the problem was with the files, not diectory reading/traversal. Ignore this.

No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and explicitly close the file (hClose).

TomMD 2010-06-06 00:42:57

Implementing this causes the program to produce no output until it has fully completed, and my system becomes unresponsive and slow long before that happens. I'm not sure what I did wrong. See http://haskell.pastebin.com/6aaqzDwQ (function getFileHash :: FilePath -> IO [Word8])

Jesse 2010-06-06 02:47:18

You appear to be using the crypto library - don't do that. Use OpenSSL or my pureMD5 library until I fix Crypto. EDIT: Also, if you want some sort of user feedback while hashing then use some sort of concurrency (forkIO).

TomMD 2010-06-06 04:46:46

1. You're pureMD5 library solves the problem of the system becoming unresponsive. Thanks.2. You're right, the way the program is structured with `putStr . unlines =<< getList "."` it will of course calculate all the hashes before doing any output, so that's my fault. Simply doing the IO operations in the right order solves this.

Jesse 2010-06-07 03:25:38

Answer 3

+6 A:

NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.

Here's my quick attempt at an Iteratee-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.

For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee well enough yet to be able to tell why easily.

module Main where

import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8) 
import System.Directory 
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS

import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW

evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext

iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
  where
    chunk s@(I.EOF Nothing) =
      get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
    chunk (I.Chunk c) = do
      modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
      return $ I.Cont (I.IterateeG chunk) Nothing

fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path

main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path) 
   =<< getRecursiveContents "."

getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
  names <- getDirectoryContents topdir

  let properNames = filter (`notElem` [".", ".."]) names

  paths <- concatForM properNames $ \name -> do
    let path = topdir </> name

    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else do
        isFile <- doesFileExist path
        if isFile
          then return [path]
          else return []

  return (sort paths)

concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)

Note that you'll need the iteratee package and TomMD's pureMD5. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)

Travis Brown 2010-06-06 06:25:25

This answer would be awesome if it worked in constant space, as the Iteratee library seems like very flexible and safe way of doing this kind of stream processing.

Jesse 2010-06-07 03:41:58

My theory is that this runs in linear space because of the lazy `StateT`. While the IO is strict, thunks are made for each call to `md5Update`, so the entire file ends up in memory in these thunks and the `md5Update`s are not evaluated until the hash is printed. Yairchu's answer runs in constant space because `foldlL` from the List package just so happens to be strict. Using `seq`, `$!` or the strict `StateT` should force the digest to be evaluated for each chunk resulting in constant space.

Jesse 2010-06-09 05:40:43

Answer 4

+1 A:

The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. And the file IO you are using is lazy, so you get a list with one open file per file path.

The simplest solution in this case is to force the evaluation of the hash for each file path. One way to do that is with Control.Exception.evaluate:

getFileLine path = do
  theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
  evaluate theHash

As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple.

Yitz 2010-06-06 10:16:18

Evaluating the hash also forces the file to be closed, so this answer is also correct.

Jesse 2010-06-07 03:31:34

Answer 5

A:

unsafeInterleaveIO?

Yet another solution that comes to mind is to use unsafeInterleaveIO from System.IO.Unsafe. See the reply of Tomasz Zielonka in this thread in Haskell Cafe.

It defers an input-output operation (opening a file) until it is actually required. Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily).

Now, I believe, mapM getFileLine opens all files but does not start reading from them until putStr . unlines. Thus a lot of thunks with open file handlers float around, this is the problem. (Please correct me if I am wrong).

An example

A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space.

getList :: FilePath -> IO [String]
getList p =
  let getFileLine path =
        liftM (\c -> (show . md5 $ c) ++ " " ++ path)
        (unsafeInterleaveIO $ BS.readFile path)
  in mapM getFileLine =<< getRecursiveContents p

(I changed for pureMD5 implementation of the hash)

P.S. I am not sure if this is good style. I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. I use it in small scripts, but I'd be afraid of relying on it in a bigger program.

jetxee 2010-06-06 14:06:49

Answer 6

+7 A:

Lazy IO is very bug-prone.

As dons suggested, you should use strict IO.

You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.

import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)

hashFile :: FilePath -> IO BS.ByteString
hashFile =
    fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024

strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
    takeWhile (not . BS.null) $ do
        handle <- liftIO $ openFile filename ReadMode
        repeat () -- this makes the lines below loop
        chunk <- liftIO $ BS.hGet handle chunkSize
        when (BS.null chunk) . liftIO $ hClose handle
        return chunk

I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.

Monadic lists/ListT come from the "List" package on hackage (transformers' and mtl's ListT are broken and also don't come with useful functions like takeWhile)

yairchu 2010-06-06 14:18:02

This answer is my favorite. Instead of working around the problems with lazy IO by forcing the hashes to be evaluated sequentially (with `!`, `Control.Exception.evaluate` or by printing the corresponding output line), this replaces it with safe, 'structured' strict IO like Travis Brown's tried with the Iteratee library. One downside, however, is that it seems slow compared to the lazy I/O solutions such as Duncan Coutt's, but I'm not familiar enough with the ListT monad to figure out why.

Jesse 2010-06-07 03:59:53

Answer 7

+16 A:

You don't need to use any special way of doing IO, you just need to change the order in which you do things. So instead of opening all files and then processing the content, you open one file and print one line of output at a time.

import Data.Digest.Pure.MD5 (md5)
import qualified Data.ByteString.Lazy as BS

main :: IO ()
main = mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path) 
   =<< getRecursiveContents "."

fileLine :: FilePath -> BS.ByteString -> String
fileLine path c = hash c ++ " " ++ path

hash :: BS.ByteString -> String 
hash = show . md5

BTW, I happen to be using a different md5 hash lib, the difference is not significant.

The main thing that is going on here is the line:

mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)

It's opening a single file, it's consuming the whole content of the file and printing one line of output. It closes the file because it's consuming the whole content of the file. Previously you were delaying when the file was consumed which delayed when the file was closed.

If you are not quite sure if you are consuming all the input but want to make sure the file gets closed anyway, then you can use the withFile function from System.IO:

mapM_ (\path -> withFile path ReadMode $ \hnd -> do
                  c <- BS.hGetContents hnd
                  putStrLn (fileLine path c))

The withFile function opens the file and passes the file handle to the body function. It guarantees that the file gets closed when the body returns. This "withBlah" pattern is very common when dealing with expensive resources. This resource pattern is directly supported by System.Exception.bracket.

Duncan Coutts 2010-06-06 14:36:36

Actually using the other md5 hash lib is somewhat significant. It means we process each file in constant space. The original program was unpacking the ByteString to a String before hashing. Not only is that slow but for reasons I do not recall, the ByteString unpack operation is fully strict which forces the whole file into memory.

Duncan Coutts 2010-06-06 15:01:43

+1 Excellent answer, and from one of the authors of the Data.ByteString library.

rtperson 2010-06-06 15:14:09

+1 for much better than my silly answer.

Travis Brown 2010-06-06 15:30:42

Well, that assumes that all we ever want to do is print all of them. Then printing them one by does force the hash calculation, yes. But by just forcing the calculation on each file as I suggested, getLine still remains useful, and it's not any more complicated.

Yitz 2010-06-06 16:53:45

@Travis Brown - Not silly at all, your answer is just at a different stage in the learning curve. I have the same problem myself: I tend to over-engineer solutions, and then kick myself when I see an expert solution that takes my 12-line solution and compresses it into a one-liner. The experts in Haskell seem to be able to combine these operations in their heads much better than I can -- something that comes, I assume, through both experience and training.

rtperson 2010-06-06 17:37:56

This answer is great because it is short. Like the other answers that stick to lazy I/O (as opposed to iterative strict I/O), it forces the file to be closed by printing the corresponding line of output before the next file is handled. I do, however, consider this a 'workaround' for the problems of lazy I/O, so I've accepted the answer using iterative strict I/O instead.

Jesse 2010-06-07 04:01:19

ansaurus

tags:

views:

answers:

Haskell lazy I/O and closing files

unsafeInterleaveIO?

An example

related questions