views:

848

answers:

14

I recently read the excellent article "The Transactional Memory / Garbage Collection Analogy" by Dan Grossman. One sentence really caught my attention:

In theory, garbage collection can improve performance by increasing spatial locality (due to object-relocation), but in practice we pay a moderate performance cost for software engineering benefits.

Until then, my feeling had always been very vague about it. Over and over, you see claims that GC can be more efficient, so I always kept that notion in the back of my head. After reading this, however, I started having serious doubts.

As an experiment to measure the impact on GC languages, some people took some Java programs, traced the execution, and then replaced garbage collection with explicit memory management. According to this review of the article on Lambda the ultimate, they found out that GC was always slower. Virtual memory issues made GC look even worse, since the collector regularly touches way more memory pages than the program itself at that point, and therefore causes a lot of swapping.

This is all experimental to me. Has anybody, and in particular in the context of C++, performed a comprehensive benchmark of GC performance when comparing to explicit memory management?

Particularly interesting would be to compare how various big open-source projects, for example, perform with or without GC. Has anybody heard of such results before?

EDIT: And please focus on the performance problem, not on why GC exists or why it is beneficial.

Cheers,

Carl

PS. In case you're already pulling out the flame-thrower: I am not trying to disqualify GC, I'm just trying to get a definitive answer to the performance question.

A: 

It is the fact that developers are human and miss things that caused the need for garbage collectors in the first place. With that being said let me say that garbage collection will always be slower than perfect explicit memory management. And garbage collection can often be faster than imperfect explicit memory management given the fact that garbage collectors clean up the things that developers tend to forget.

Andrew Hare
Sorry, but this is not true. The program using explicit memory management is either correct, and then it frees everything (nothing is forgotten), or incorrect. The (im)perfectness only means you hold objects longer than necessary, and in this issue, GC chooses the most imperfect (or pessimistic) way
jpalecek
... You can, however, solve this by explicitly setting references to null.
jpalecek
@jpalecek: "GC chooses the most imperfect (or pessimistic) way". Obviously not true. The most imperfect is defer collection indefinitely, i.e. to leak. GCs collect after values become unreachable. Manual memory management does not collect at the earliest possible place. In particular, common solutions like reference-counted smart pointers in C++ defer collection to the end of scope which can be a long time after a value became unreachable and would have been collected by a tracing GC.
Jon Harrop
+15  A: 

The cost of memory allocation is generally much lower in a garbage collected memory model, then when just using new or malloc explicitly because garbage collectors generally pre-allocate this memory. However, explicit memory models may also do this (using memory pools or memory areas); making the cost of memory allocation equivalent to a pointer addition.

As Raymond Chen and Rico Mariani pointed out, managed languages tend to out perform unmanaged languages in the general case. However, after pushing it, the unmanaged language can and will eventually beat the GC/Jitted language.

The same thing is also evident in the Computer Language Shootout because even though C++ tends to rank higher than Java most of the time, you'll often see C++ implementations jumping trough various hoops (such as object pools) to achieve optimal performance. Garbage collected languages, however, tend to have easier to follow and more straight forward implementations because the GC is better at allocating (small chunks of) memory.

However, performance isn't the biggest difference when it comes to GC vs non-GC; arguably it's the deterministic finalization (or RIIA) of non-GC (and reference counted) languages that is the biggest argument for explicit memory management because this is generally used for purposes other than memory management (such as releasing locks, closing file or window handles et cetera). 'Recently' however C# introduced the using / IDisposable construct to do exactly this.

Another problem with garbage collection is that the systems they use tend to be rather complex to prevent memory leaks. However, this also makes it way more difficult to debug and track down once you do have a memory leak (yes, even garbage collected languages can have memory leaks).

On the flip side, the garbage collected language can do the most optimal thing at the most optimal time (or approximately) without having to burden the developer with that task. This means that developing for a GC language might be more natural, so you can focus more on the real problem.

Jasper Bekkers
The example of Chen and Mariani is certainly **not** the general case. Rather, it is one very special (albeit commonly occurring) problem where .NET happens to perform better for pure technical reasons. There is no a priori reason for managed languages to perform better in similar scenarios.
Konrad Rudolph
+1 @ Jasper, with a caveat - IDisposable has nothing to do with deterministic finalisation. It is for cleaning up external resources and has no direct impact on managed memory collection.
Jim Arnold
@Konard, I know their problems was more concerned with i18n and string manipulation. But I feel that it's more representative of a real world application than the "performance critical" application C++ is generally used for these days.
Jasper Bekkers
@Jim, Correct. I was hinting at using the memory management semantics for different purposes such as often done with RAII. After that, I that I tried to point out how to do a similar thing in .Net (without the proper tools).
Jasper Bekkers
@Jim, I updated the post to reflect my argument better.
Jasper Bekkers
I'm seeing a lot of best case GC compared to worst case manual memory management (naive malloc). This seems like comparing a sports car to a bicycle and concluding that two-wheeled transport is slow. (http://en.wikipedia.org/wiki/Straw_man)
Waylon Flinn
@Waylon, I've tried my best to compare general cases. If I've missed something, please notify me in the comments of any specifics.
Jasper Bekkers
The article you linked to actually supports the idea that a non-naive allocator can generally be made to outperform GC. This directly contradicts the statement: "managed languages tend to out perform unmanaged languages in the general case". Perhaps you got that backwards?
Waylon Flinn
@Waylon, what I meant with that statement is that managed languages outperform native languages when coding in a straight forward way. That is, when no optimizations are done (yet).
Jasper Bekkers
@Jasper, Ah, thank you for clarifying that. I definitely believe that, strangely, the state of the art for GCs is more advanced. However, this usually seems to produce comparable performance at best, with a few exceptions. What I'd like to see is more work done on manual management to make it easier to use and more performant. I think they could probably learn some things from the GC community toward this end.
Waylon Flinn
@Waylon, there are libraries out there that help with these things to some extend, such as boost/object_pool.
Jasper Bekkers
And modern malloc() implementation tend to outperform custom allocators, including, I guess, object pools:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.5392Disclaimer: I just read this summary, not the article: http://www.thegibson.org/blog/archives/17
Blaisorblade
You all seem to be assuming single threaded programs. Pool allocators break down in multithreaded programs whereas GC does not.
Jon Harrop
A: 

In theory, a well profiled program can inform an intelligent GC subsystem to attain the described speedups over manually memory management. These speedups may not be visible without long runtimes, to amortize the fixed startup cost of GC.

In practice, you will likely NOT realize these speedups with present-day GC implementations. Furthermore, you will NOT get a definitive answer, because there will always be pathologically bad scenarios for both cases.

HUAGHAGUAH
+1 for practicality
Waylon Flinn
A: 

In theory, GC may be faster in some cases, but I have never seen that, and I doubt I ever will. Also, C++ with GC such as the Boehm GC will probably always be slower because it is conservative. With all the pointer fiddling in C++, the GC has to pretend everything is a pointer. With a language like Java, it can know exactly what is and isn't a pointer, so it may have the potential to be faster.

Zifre
@Zifre: "GC may be faster in some cases, but I have never seen that". Logic programs are a good example of programs that often get higher throughput with GC because they are so difficult to write correctly (let alone optimize) without a GC.
Jon Harrop
+3  A: 

Here's an experiment that I like to run:

  1. Start up a program written in a garbage collected environment (like .NET or Java).
  2. Start up a similar program written in a non garbage collected environment (like C or C++).
  3. Use the programs and see which one is more responsive.

Objectivity improvement: get your grandmother to do step 3.

It's all well and good to quote theoretical performance of optimal GC implementations but the fact of the matter is that in real world scenarios programs written in garbage collected languages do not perform as well as native applications. This is why large projects where performance translates directly into user experience still program in C++. The classic example of this is game programming.

Another, perhaps counterintuitive, example of this is the Eclipse IDE. While it may be written in Java the entire graphical subsystem had to be rewritten to produce acceptable performance. The solution: make GUI elements lightweight wrappers around native (C/C++) components (SWT).

I understand the draw of garbage collected environments. Memory management is hard to get right. And a lot of work. The bottom line though is this: knowing how your program is supposed to behave gives you (the programmer) an edge over a machine trying to guess.

Waylon Flinn
I don't think getting your grandmother to test the responsiveness of execution environments qualifies as "hard data". As for "the fact of the matter is that in real world scenarios programs written in garbage collected languages do not perform as well as native applications" - where is the evidence to support this statement?
harto
@harto It certainly is "hard data". It's "hard data" about perceived performance. If you care more about actual performance get out a stopwatch. The SWT project would not exist if there weren't performance issues with Java, .NET isn't spared this failing. Have you ever played a game written in a language with a GC? If so, can you honestly say it was as responsive as a comparable non-GC game? Everyone likes to toot the GC horn. No one likes to run the GC applications.
Waylon Flinn
-1: "the fact of the matter". How do you explain the fact that Visual Studio is migrating in the opposite direction and getting better results from C# than Eclipse is getting from C++?
Jon Harrop
@Jon: Bad example: VS2010 feels worse and less responsive than 2008.
Zan Lynx
+3  A: 

There are quite a bit of different arguments given here. I want to start by making clear that you cannot really make a 1:1 comparison. Each has its pros and cons, and any code snippet will be more appropriate for one or the other system. That is as much to say, on the contrary, that you must know whether you have GC or not to write efficient code.

My argument is you must know your environment and code acordingly. That will make your code efficient. Moving from one paradigm to the other and coding the same style will make your code more inefficient than what the GC really helps/takes away.

Case:

A program makes thousands of heap memory allocations for short lived objects. That is, it allocates and deallocates many times, with different size of objects.

On a non-GC environment, for each allocation you would end up calling malloc, and that requires locating in the list of free memory fragments the most suitable one (according to the specific malloc implementation). The memory is used and then it is freed with free [or new/delete in C++...]. The cost of memory management is the cost of locating the fragments.

On a GC environment, with a movable GC as java or .net are, after each GC run all free memory is contiguous. The cost of acquiring memory for an object is cheap, really cheap (<10 cpu instructions in Java VM). At each GC run, only alive objects are located and moved to the beginning of the appropriate memory region (usually it is a different region than the original one). The cost of memory management is primarily the cost of moving all reachable (alive) objects. Now, the premise is that most objects are shortlived so at the end the cost may be smaller than that of a non-GC system. One million objects allocated and freed (forgotten) on a single GC run amount to no extra cost.

Conclusion: In GC languages you can create all local objects on the heap. They are cheap. On the other hand, in non-GC systems, a bunch of allocations, deallocations and new allocations is expensive. The memory is fragmented and the cost of malloc increases... In non-GC systems you should use the stack as much as possible, using the heap out of necessity.

That has another implication. People used to one of the two memory systems will tend to write inefficient programs in the other. They are used to some idioms that are probably bad on the other system.

A clear example is a non-managed programmer that is used to allocate an object and reuse (reset its internal pointers with new elements as required) is used to that way of thinking: allocation is expensive, reusing is cheap. Now, if the same exact code is moved to a generational GC environment (Java, .net - both are move-generational-GC), you get a funny effect. In Java generational GC the system will perform minor collections only on the younger generations, only processing older generations in full collections. But an object in the young generation could be referred to by objects in the older generation, so extra work has to be performed to keep track of this old-to-young references. In particular in Java 1.4.1 garbage collector the system will mark the memory card (sub-part of page) where the old object resides and it then includes all the marked cards for processing during the minor collection, effectively increasing the amount of work that the GC has to perform and possibly impacting performance.

The object was alive during 1, 2, 3... GC runs, and it was moved that many times around, finally is moved to the old generation where it will not be moved in each GC run but can just stand there... but alas, the programmer forces the object to become a younger. It is moved once again, and it will again be moved each time the GC runs up to the time where it becomes old again.

To make a sensible comparison, you would need to get to programmers that know the environment write different pieces of code that solve the same problem with the same algorithms with different mind sets about memory management. Then compare the results of both of them.

David Rodríguez - dribeas
This is exactly why you'll commonly see memory or object pools being used in non-GC languages. Because the allocation time can be painfully slow.
Jasper Bekkers
I must correct one of your statements. An object in the old generation is never copied back to the old one, but it is rather registered in a "remembered set" of additional roots to scan during minor collections.The main difference is that there is no need to search for pointers to it and update them (which would be enormously slow). Apart from that (and it's not a detail), one might view the object as part of the "semantic" young generation.
Blaisorblade
@Blaisorblade: I had read that objects were moved to the younger generation (maybe in a different GC). I have rechecked the docs and according to the linked document, old-to-young generation references don't add references to the root set, but rather mark parts of the old generation for processing during the minor collection --increasing the amount of work to be performed by the GC.
David Rodríguez - dribeas
I never mentioned the root set - the remembered set is something different. And what you describe is just one possible implementation of the remembered set. I agree that object pools are bad for performance in Java, since they cause slowdown like you describe now.But the difference is that the amount of extra work is proportional to the amount of updates references (with a high constant), and performed at GC time (so some coalescing will happen), rather to the whole heap size (as would be the case with what you write in the post) and at each "bad" pointer write.
Blaisorblade
@Blaisorblade: Just follow the link and read the docs. After your comment I rechecked the docs and decided to follow up to the letter. The last comment (and the edit to the answer) deal with the concrete implementation of the hotspot GC in Java 1.4.1, as described in the official document linked. I did not mention the whole heap --in particular it is the memory-cards of the updated object, which is smaller than a memory page--, and the document explicitly states that *Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set*
David Rodríguez - dribeas
The current text is fine - it still says 'the object is moved' afterwards, and you might want to fix it, but I already upvoted it :-DMy description in terms of a remembered set is more general (in this case, the remembered set is approximated through card marking) and is also described in the document you link (the term 'remembered set' never appears but the concept is there), so I would use that one, but this one is also OK.In the old version, moving the object and updating pointers to it had O(heap size) complexity to find all pointers - so I wrote 'whole heap'. But it doesn't matter now.
Blaisorblade
A: 

As @dribeas points out, the biggest 'confound' to the experiment in the (Hertz&Berger) paper is that code is always written under some 'implicit assumptions' about what is cheap and what is expensive. Apart from that confound, the experimental methodology (run a Java program offline, create an oracle of object lifetimes, instrument back in the 'ideal' alloc/free calls) is actually quite brilliant and illuminating . (And my personal opinion is that confound does not detract too much from their results.)

Personally, my gut-feel is that using a GC-ed runtime means accepting a factor-of-three performance hit to your application (GC'd will be 3x slower). But the real landscape of programs is littered with confounds, and you'd be likely to find a huge scatterplot of data if you could perform the 'ideal' experiment on lots of programs across many application domains, with GC sometimes winning and Manual often winning. (And the landscape is continually changing - will the results change when multicore (and software designed for multicore) is mainstream?)

See also my answer to

http://stackoverflow.com/questions/354124/are-there-statistical-studies-that-indicates-that-python-is-more-productive/354249#354249

which has the thesis that "due to so many confounds, all evidence about software engineering is anecdotal".

Brian
A: 

See also

http://prog21.dadgum.com/40.html

which discusses the "sufficiently smart" compiler. The landscape of CS/software is riddled with ideas/techiques which can be in theory more performant than the status-quo. But it's all snake oil.

GC is expensive today, and may always be.

Brian
A: 

Side note: another interesting experiment to run, that I haven't seen people try, is to compare with just leaking. Call alloc and never free. It's an interesting alternative.

Except for long-running server apps, you'll never run out of memory in practice, the OS will just start using disk for virtual memory (machines have effectively infinite memory, up to the limits of virtual address space, which I think is huge now with 64-bit machines). This highlights that a GC is nothing more than a device for improving locality. Leaked/dead objects don't "hurt" when you have infinite memory, except that memory comes in hierarchies and you want to keep the 'live' objects nearby and fast and the 'dead' objects on the faraway/slow memory. If each object was allocated on a different page, then the OS virtual memory system would effective be a GC.

Brian
+9  A: 

This turns into another flamewar with a lot of "my gut feeling". Some hard data for a change (papers contain details, benchmarks, graphs, etc.):

http://www.cs.umass.edu/~emery/pubs/04-17.pdf says:

"Conclusion. The controversy over garbage collection’s performance impact has long overshadowed the software engineering benefi it provides.This paper introduces a tracing and simulation-based oracular memory manager. Using this framework, we execute a range of unaltered Java benchmarks using both garbage collection and explicit memory management. Comparing runtime, space consumption, and virtual memory footprints, we find that when space is plentiful, the runtime performance of garbage collection can be competitive with explicit memory management, and can even outperform it by up to 4%. We fi that copying garbage collection canrequire six times the physical memory as the Lea or Kingsley allocators to provide comparable performance."

When you have enough memory, copying GC becomes faster than explicit free() - http://www.cs.ucsb.edu/~grze/papers/gc/appel87garbage.pdf

It also depends on what language you use - Java will have to do a lot of rewriting (stack, objects, generations) on each collection and writing a multithreaded GC that doesn't have to stop the world in JVM would be a great achievement. On the other hand, you get that almost for free in Haskell where GC time will rarely be >5%, while alloc time is almost 0. It really depends what you're doing and in what environment.

viraptor
Thanks for a detailed and factual answer!
Carl Seleborg
I think this answer provides enough information to show what needs to be taken into account when deciding on whether to use garbage collection in a project.
Carl Seleborg
@viraptor: I was going to upvote because you cited data but the conclusions you drew are completely unjustified. You cannot assume that the proportion of time spent in Haskell's GC is a measure of its overhead. You must also take into account the performance hit on the mutator of having to work in harmony with the GC and the performance difference due to the change in programming style which is likely to be 1,000% in the case of Haskell due to the relatively poor performance of purely functional data structures.
Jon Harrop
@Carl Seleborg: "provides enough information to show what needs to be taken into account when deciding on whether to use garbage collection in a project". That's insane. You don't decide whether or not to use garbage collection based upon a 4% difference in throughput. There are always *vastly* more important concerns...
Jon Harrop
@Jon Harrop: "due to the relatively poor performance of purely functional data structures" sigh... [citation needed] Even stupid benchmarks like the language shootout on alioth show haskell, lisp and ocaml in range of 3-5 times slower than extremely tweaked C algorithms.
viraptor
@viraptor: http://flyingfrogblog.blogspot.com/2010/07/haskells-hash-tables-revisited.html
Jon Harrop
@Jon Harrop: "The latest Haskell is now back to being only 5.7× slower than F#." - which matches language shootout's results. 1. That's not 1000%. 2. Extrapolating ghc's performance to all "purely functional data structures" is not very nice. In theory there's nothing stopping it from being as fast as another implementation - apart from implementation details. 3. On the same blog there's ocaml hash comparison which is only 3 times slower than f#. 4. Look in the comments where people reduce the runtime 5x, simply by tweaking the heap size.
viraptor
@viraptor: 1. You compared the wrong numbers. Look at: F# hash table 0.75s, Haskell Map 7.5s. 2. Your theory is wrong. You get the same results in OCaml/F#/SML as well and it is caused by the asymptotically higher cache complexity of purely functional data structures. 3. Irrelevent, hash tables are not purely functional data structures. 4. Also irrelevant, you're talking about disabling GHC's GC in order to work around a bug wrt hash tables which has nothing to do with this conversation about purely functional data structures.
Jon Harrop
@viraptor: "Even stupid benchmarks like the language shootout on alioth show haskell, lisp and ocaml in range of 3-5 times slower than extremely tweaked C algorithms". That says nothing about purely functional data structures.
Jon Harrop
A: 

GC will always be slower than the extreme alternative: perfect, non-deterministic memory management.

The questions are:

  • Are the differences large enough to quibble about?
  • Are the drawbacks of one technique enough to cause us to seriously consider the other?

There are other areas in which managed subsystems have won out over unmanaged ones:

In general, a program will always run slower on a multitasking operating system than on a uni-tasking one -- or a computer with no OS.

In general, a program will always run slower on a system with virtual memory than on one without.

Except in extreme circumstances, do we seriously consider computer systems without VM and without an OS?

Barry Brown
+2  A: 

Berger's paper is being cited a lot, but it is comparing real garbage collectors against a purely theoretical, offline, optimal algorithm. So while it may tell you something about theoretical limits, it says very little about the performance of real garbage collectors versus real implementations of malloc and free. A study that I like better took real programs and compared explicit malloc and free with Hans Boehm's conservative garbage collector:

The Measured Cost of Conservative Garbage Collection by Ben Zorn

This study isn't perfect, and Zorn is careful to note that if the programs knew they were using a garbage collector, some could be made faster. But the hard data is this: - In real programs originally written to use malloc and free, garbage-collected versions run at about the same speed but require twice as much memory. Zorn argues fairly convincingly that if you know you have GC, you can make things faster, but it's hard to eliminate the memory penalty.

I learned more from this careful experimental study than from Berger's study of an unimplementable, idealized memory manager.

Norman Ramsey
As far as I see, they are not proposing the usage of an ideal algorithm. They are instead using such an algorithm to add calls to free() to Java programs. What is missing from the picture is that one might need to use manual reference counting to perform some of those calls, in a real program (if multithreaded), and this would cause a certain overhead.But my gut feeling is that in a C++ program counting smart pointers are used on few objects, and so the cost would be limited and their result would still be valid.Edit: I just read the summaries of the paper, not the paper itself.
Blaisorblade
A: 

One pragmatic issue is that with explicit MM it is generally a lot easier to profile, identify the bottleneck, and resolve it.

With a GC system, when your nice O(N) code turns out to trash the GC in a pathological way that makes it O(heap size), it can be harder to work out what is going wrong. Sometimes even as hard as fixing a memory corruption.

soru
Can you elaborate on the situation you describe? I have trouble translating it to something more concrete.I guess trashing the GC means invoking too many garbage collections by producing much garbage, but I don't really see how: a generational garbage collector with copying for the young generation (i.e. the minimum standard) would still make a collection this O(live data in the young generation). I see a way to make this fail but it is not realistic and it is not O(N).
Blaisorblade
@Blaisorblade: You can degrade asymptotic complexity but I cannot see how to get O(heap size). For example, a naive non-tail recursive `map` function eats *O(n)* stack which, in turn, makes the GCs periodic stack traversal *O(n)* which adds an extra factor of *n* to any algorithm (so `map` is *O(n^2)*).
Jon Harrop
@Jon Harrop: I answered in a separate answer.@soru: heap profilers exist for this purpose. Could you comment on why they fail to achieve it?
Blaisorblade
They can help, if used before you have a real problem. But Once a program starts running pathologically slow, profiling it (with any free tool anyway) would possibly take longer than the heat death of the universe.
soru
+1  A: 

After discussion on another answer, a disadvantage which turns out is that GC can alter the asymptotic complexity. Soru's comment stated it implicitly without examples, and one comment is not enough to explain it. Thanks to Jon Harrop for the example he suggested and useful comments on this answer. However, a good GC should still amortize the costs (given enough memory, as always), as I explain in the end.

With a GC system, when your nice O(N) code turns out to trash the GC in a pathological way that makes it O(heap size), it can be harder to work out what is going wrong.

First, this happens often when the size of the working set is close to the maximum heap size. The GC is invoked too often and thus everything slows down. Install the Scala plugin for Eclipse and you'll feel it.

Usually, however, having enough memory space and a generational GC prevent it, if your algorithm just produce lots of garbage quickly detectable as such: it won't survive long enough to get out of the nursery.

Here's an example for an exception to that: let us take "map f list", and assume each application of f does consume live memory, by saving a reference to the returned value in a hash-table. The asymptotic complexity, here, should still be O(1).

The generational GC won't be able to free memory by collecting the nursery, thus a major collection (O(heap content)) is invoked somewhat periodically. Hence, we get that runtime is at least proportional to (heap content size) * n * (space consumed by each call to f) / (nursery size).

The GC will actually increase nursery size up to the specified maximum, and then the above happens again for n big enough. However, if the specified maximum is Big-Theta(max. heap size) and thus Omega(heap content size), major collections become infrequent, and the cost of minor collections is proportional to produced data (and thus to the runtime needed to produce them). This is similar to when you need to copy a container to resize it: by growing it enough, you can amortize the cost of copying (or of GC) and make it comparable to the cost needed to cause the copy.

This whole answer does not take in consideration the issue of pause times - collections become infrequent but are long. It implicitly assumes that stack scanning time is constant - but it should be indeed, unless your algorithm is non-tail-recursive and therefore risks problems anyway on the big inputs considered here.

Finally, it is not about incremental garbage collectors. They solve the problem at the root, but they are used mostly for real-time GC, because of the overhead they add to memory reads. Azul System solved this on their own custom HW with their Pauseless GC, thanks to HW support to optimize this overhead. They also recently claim to have solved the problem also for x86, see this "press release" and this article. But it's definitely under development and does not run without changes to the OS. When that's done, and if the performance is as they suggest, maybe the problem will be solved also for us normal mortals.

Blaisorblade
@Blaisorblade: Good answer but I still have three reservations. Firstly, amortizing the cost of deep stacks is hard because it requires stacks to be traversed incrementally. Secondly, Baker's treadmill would seem to be an obvious counter example to the O(heap size) claim. Finally, we're neglecting pause times which are often as important as throughput.
Jon Harrop
Baker's treadmill is an incremental GC, requiring a read barrier and thus slowing down every computation; it is a counterexample, but theoretical. The only in-production incremental GC without huge slowdowns that I know is Azul Systems's Pauseless GC, and they first implemented custom HW for it (much like Lisp machines), before (they claim) implementing it on x86.I had forgot your note about deep stacks, also because you want to avoid them anyway to avoid stack overflow. O(n) stack depth _is_ a bug by itself, per your definitions elsewhere, isn't it?Point taken on pause times.
Blaisorblade
You have to keep in mind that malloc-type MMs can still have pathological O(head-size) behavior, both in time and space (e.g. with fragmentation). BTW, the Lisp machines (like Symbolics) had custom hardware to help with things like GC, but by dedicating transistors to memory management instead of computation, it just made computation slower, so there was no real advantage to it.
Gabe
O(head-size) -> O(heap-size), right? That'd be an interesting point, never seen data about that (except for fragmentation, which is famous).About Lisp machines, I think Azul has just fastpaths in HW (in this case, mostly read barriers), and GC-ing 600 GB/s (as they claim) seems a pretty good result. I tend to trust Cliff Click's reputation and extend it to Azul, that's why I trust this otherwise crazy data. http://www.stanford.edu/class/ee380/Abstracts/070221.html.They never write their single-core raw performance, but IIRC they aimed for many not-so-fast cores for scalability.
Blaisorblade