views:

129

answers:

2

Greetings,

I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.

EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'

Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.

(NB the source file is tab-separated lines)

Thanks in advance,

Kevin

module Main where

import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad


type Record = BS.ByteString

importRecords :: String -> IO [Record]
importRecords filename = do
    liftM (map importRecord.BS.lines) (BS.readFile filename)

importRecord :: BS.ByteString -> Record
importRecord txt = r
  where 
    r = getField 26
    getField f = BS.copy $ ((BS.split '\t' txt) !! f)

loopInput :: [Record] -> IO ()
loopInput jrs = do
    putStrLn $ "Done" ++ (show $ last jrs)
    hFlush stdout
    x <- getLine
    return ()

    -- (***)
    loopInput jrs

main = do 
    jrs <- importRecords "c:\\downloads\\lcg1m.txt"
    loopInput jrs
+2  A: 

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.

The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.

importRecord now has to call toRecord r to get the right type:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).

alt text

Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.

FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

TomMD
Thanks for the comprehensive response; it's insightful to see the use of seq/deepSeq to force the evaluation, but I'm still not seeing why this program misbehaves in the first place. It seem to hinge on whether the line below (\*\*\*) is there; if it is then the large bytestring hangs around in memory. If I edit the line "show (last jrs)" to read "show jrs" then this also should force the *entire* list to be evaluated -correct? In this case, the program again runs in 1-2M without the (\*\*\*) line, but holds the file in memory if "loopInput jrs" is called again. Sorry if I'm being a bit thick!
Kevin
Using the above `Record` definition (~ 24 bytes per entry, 4.1M entries) and `show jrs` instead of `show $ last jrs` I see 300MB taken up (~ 100MB for the Record list, 50MB of ?, 2x for copying GC). I think the foreign alloc will only allocate blocks of 64+ bytes, so for each of your 1M entries you have 1 word for the LPS constructor, 1 word for the PS constructor, 1 word for the pointer, 2 words for the length and offset, 1 word for the Next LPS field, 64+ bytes for the actual data, lists take 3 words --> 9*8+64 = 136B, 136MB for 1M entries * 2 for GC + potential any unknowns.
TomMD
Kevin: The math in my previous comment doesn't add up to your 800MB use (I know, I see that), but its progress. Perhaps you could run some profiling or replace your BS.ByteString with a packed record (like mine) and see how things go?
TomMD
Kevin: And my 300M result is further reduced to 44M if I use `Data.Vector.Unboxed`. Go Roman!
TomMD
I'll look into the Vector.Unboxed thing, looks like it will greatly help the situation. I've also tried the packed record approach, and the strictness techniques from your example but it didn't seem to make a difference, odd - perhaps a difference in the way the Windows version of GHC works. Thanks for all your help!
Kevin
Kevin: That is odd. If your still interested then try to replace `let jrs' = using jrs (evalList rdeepseq)` with `let !x = rnf jrs` and use `jrs` (drop any mention of `jrs'`). This should be pretty much the same for your particular case.
TomMD
+1  A: 

There seem to be two questions here:

  • why does the memory usage depend on the presence or absence of the line (***);
  • why is the memory usage with (***) present about 800MB, rather than, say, 40MB.

I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)

As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

Replace the map in importRecords with map' and see whether that reduces your memory usage.

Reid Barton
I did try the "show jrs" modification, but it didn't reduce the memory sadly. Thanks for your help though, the answers posted here have been insightful to me.
Kevin