ansaurus

Question

Haskell ByteStrings - ending up with large file loaded into memory

Answer 1

+2 A:

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.

The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.

importRecord now has to call toRecord r to get the right type:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).

alt text

Sorry about that other ~~misleading~~ wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.

FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

TomMD 2010-10-19 00:17:05

Thanks for the comprehensive response; it's insightful to see the use of seq/deepSeq to force the evaluation, but I'm still not seeing why this program misbehaves in the first place. It seem to hinge on whether the line below (\*\*\*) is there; if it is then the large bytestring hangs around in memory. If I edit the line "show (last jrs)" to read "show jrs" then this also should force the *entire* list to be evaluated -correct? In this case, the program again runs in 1-2M without the (\*\*\*) line, but holds the file in memory if "loopInput jrs" is called again. Sorry if I'm being a bit thick!

Kevin 2010-10-19 09:18:56

Using the above `Record` definition (~ 24 bytes per entry, 4.1M entries) and `show jrs` instead of `show $ last jrs` I see 300MB taken up (~ 100MB for the Record list, 50MB of ?, 2x for copying GC). I think the foreign alloc will only allocate blocks of 64+ bytes, so for each of your 1M entries you have 1 word for the LPS constructor, 1 word for the PS constructor, 1 word for the pointer, 2 words for the length and offset, 1 word for the Next LPS field, 64+ bytes for the actual data, lists take 3 words --> 9*8+64 = 136B, 136MB for 1M entries * 2 for GC + potential any unknowns.

TomMD 2010-10-19 16:04:03

Kevin: The math in my previous comment doesn't add up to your 800MB use (I know, I see that), but its progress. Perhaps you could run some profiling or replace your BS.ByteString with a packed record (like mine) and see how things go?

TomMD 2010-10-19 16:07:48

Kevin: And my 300M result is further reduced to 44M if I use `Data.Vector.Unboxed`. Go Roman!

TomMD 2010-10-19 16:19:56

I'll look into the Vector.Unboxed thing, looks like it will greatly help the situation. I've also tried the packed record approach, and the strictness techniques from your example but it didn't seem to make a difference, odd - perhaps a difference in the way the Windows version of GHC works. Thanks for all your help!

Kevin 2010-10-22 18:26:27

Kevin: That is odd. If your still interested then try to replace `let jrs' = using jrs (evalList rdeepseq)` with `let !x = rnf jrs` and use `jrs` (drop any mention of `jrs'`). This should be pretty much the same for your particular case.

TomMD 2010-10-22 18:33:30

Answer 2

+1 A:

There seem to be two questions here:

why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.

I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)

As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

Replace the map in importRecords with map' and see whether that reduces your memory usage.

Reid Barton 2010-10-20 04:32:23

I did try the "show jrs" modification, but it didn't reduce the memory sadly. Thanks for your help though, the answers posted here have been insightful to me.

Kevin 2010-10-22 18:27:53

ansaurus

tags:

views:

answers:

Haskell ByteStrings - ending up with large file loaded into memory

related questions