tags:

views:

3619

answers:

2

I have a Haskell program which processes a text file and builds a Map (with several million elements). The whole thing can run for 2-3 minutes. I found that tweaking the -H and -A options makes a big difference in running time.

There is documentation about this functionality of the RTS, but it's a hard read for me since I don't know the algorithms and terms from GC theory. I'm looking for a less technical explanation, preferably specific to Haskell/GHC. Are there any references about choosing sensible values for these options?

EDIT: That's the code, it builds a trie for a given list of words.

buildTrie :: [B.ByteString] -> MyDFA 
buildTrie l = fst3 $ foldl' step (emptyDFA, B.empty, 1) $ sort $ map B.reverse l where
    step :: (MyDFA , B.ByteString, Int) -> B.ByteString -> (MyDFA , B.ByteString, Int)
    step (dfa, lastWord, newIndex) newWord = (insertNewStates, newWord, newIndex + B.length newSuffix) where            
        (pref, lastSuffix, newSuffix) = splitPrefix lastWord newWord
        branchPoint = transStar dfa pref

        --new state labels for the newSuffix path
        newStates = [newIndex .. newIndex + B.length newSuffix - 1]
        --insert newStates
        insertNewStates = (foldl' (flip insertTransition) dfa $ zip3 (branchPoint:init newStates) (B.unpack newSuffix) newStates)
+35  A: 

Generally speaking, garbage collection is a space/time tradeoff. Give the GC more space, and it will take less time. There are (many) other factors in play, cache in particular, but the space/time tradeoff is the most important one.

The tradeoff works like this: the program allocates memory until it reaches some limit (decided by the GC's automatic tuning parameters, or explicitly via RTS options). When the limit is reached, the GC traces all the data that the program is currently using, and reclaims all the memory used by data that is no longer required. The longer you can delay this process, the more data will have become unreachable ("dead") in the meantime, so the GC avoids tracing that data. The only way to delay a GC is to make more memory available to allocate into; hence more memory equals fewer GCs, equals lower GC overhead. Roughly speaking, GHC's -H option lets you set a lower bound on the amount of memory used by the GC, so can lower the GC overhead.

GHC uses a generational GC, which is an optimisation to the basic scheme, in which the heap is divided into two or more generations. Objects are allocated into the "young" generation, and the ones that live long enough get promoted into the "old" generation (in a 2-generation setup). The young generation is collected much more frequently than the old generation, the idea being that "most objects die young", so young-generation collections are cheap (they don't trace much data), but they reclaim a lot of memory. Roughly speaking, the -A option sets the size of the young generation - that is, the amount of memory that will be allocated before the young generation is collected.

The default for -A is 512k: it's a good idea to keep the young generation smaller than the L2 cache, and performance generally drops if you exceed the L2 cache size. But working in the opposite direction is the GC space/time tradeoff: using a very large young generation size might outweigh the cache benefits by reducing the amount of work the GC has to do. This doesn't always happen, it depends on the dynamics of the application, which makes it hard for the GC to automatically tune itself. The -H option also increases the young generation size, so can also have an adverse effect on cache usage.

The bottom line is: play around with the options and see what works. If you have plenty of memory to spare, you might well be able to get a performance increase by using either -A or -H, but not necessarily.

Simon Marlow
Thanks for the answer.So, if I understood correctly, if the program creates big data structures and use a lot of memory it's better to give it a hint in advance and set big -H and -A. That's what my limited experience has shown. Is that right?Also there are some more options which look even more magical: -c, -F, -G. Can they improve the running time like -H? (I found that they can make it worse)
Daniel Velkov
It's not that simple, unfortunately. If your program needs a lot of memory then it is more likely that using a larger -A or -H will help, but not always - the best thing to do is try it and see (use +RTS -s to measure).The most common performance issue that people see is when the program is creating a large amount of long-lived data in a short space of time (as your program does). In this situation, the generational GC assumption that "most objects die young" is invalid, and generational GC is hurting rather than helping. This is where using a large -A value often helps.
Simon Marlow
Btw, wouldn't it be useful to have a tool which searches the space of possible options and gives you the optimal ones for your program. I guess the standard techniques should work, because the function is smooth enough.
Daniel Velkov
@djv I wrote such a tool today: http://hackage.haskell.org/package/ghc-gc-tune
Don Stewart
Wow, that was fast!
Daniel Velkov
+4  A: 

It is probably possible to reproduce the issue for smaller data sets, where it will be easier to see what is going on. In particular, I suggest to gain some familiarity with profiling:

Then, check whether the memory profiles you see match your expectations (you don't need to know about all the profiling options to get useful graphs). The combination of strict foldl' with a non-strict tuple as the accumulator would be the first thing I'd look at: if the components of the tuple are not forced, that recursion is building up unevaluated expressions.

Btw, you can build up a good intuition about such things by trying to evaluate your code by hand for really tiny data sets. A few iterations will suffice to see whether expressions get evaluated or remain unevaluated in the way that fits your application.

claus
I added strictness to the tuple elements (using bang patterns, is that right way?) and couldn't see any difference in running time.You propose running with a smaller data set, but then I can't see exactly the effects I'm asking for, because the GC starts to matter with bigger data/more memory allocated.
Daniel Velkov
Yes, bang patterns are one way and should suffice for `ByteString` and `Int`, for `MyDFA`, it depends on your definition. A difference in space usage would probably show up in running time as well, which is why I suggested checking. Profiles provide you with more information about where that runtime and total heap usage are actually spent, helping you to ensure that you are only using the resources you need. Those GC options are helpful once you are sure that the heap profiles you are seeing are unavoidable. Just wanted to make sure that you tune your code before you tune the way it is run.
claus
Can I use the heap profile output (from the -hX options) to find if I am building unnecessary unevaluated expressions? I am not such a Haskell guru to make conclusions from paper and pencil evaluation.Maybe I should open another question asking how to profile a Haskell program memory behaviour.
Daniel Velkov
Don't forget you can get a quick heap profile by just running your program with +RTS -h, no need to recompile for profiling. This tells you the shape of the heap profile, and what is taking up the space, but doesn't tell you what part of the program created the data, for that you need to recompile for profiling.
Simon Marlow
@Simon That's strange, compiling with -prof and without it give different graphs when running with +RTS -h. Why is that?
Daniel Velkov