views:

40

answers:

1

I've read about the different kinds of GC and how they work. All involve traversing the set of memory which might potentially be reclaimed, but nothing I've read gives any indication as to how this is actually done. Would something like the following be way off the mark?

void* myAlloc(size_t size)
{
    if (needToGc())
        gc();
    void* obj = malloc(size);
    if (!obj)
        outOfMemory(); // or maybe GC and try once more
    *alloced = obj;
    *alloced = *alloced + 1;
    return obj;
}
void gc()
{
    // go through memory pointed to in `alloced`
}

I suspect so, but it's the only thing that's come to mind...

+1  A: 

There is a lot of variety in garbage collectors, depending on requirements (memory overhead, latency, data layout, …). Some of my answer may not apply to some sophisticated collectors.

First, there is no need for needToGc: the garbage collector is triggerred if malloc fails.

As to how memory is traversed, the garbage collector's task is to separate memory that's in use from memory that's not in use, and reclaim the latter. The basic principle is that memory is in use if it is reachable from the program. This is determined as follows:

  • Some root objects are deemed reachable. For example, anything on the stack is reachable, and any global variable is reachable.

  • If a reachable object points to another object, then that other object is also reachable.

  • Anything left over is unreachable, hence deemed not in use and reclaimable.

When the garbage collector is triggered, it goes over all the root objects, and for each of them recursively traverses objects that are reachable. Once the collector has traversed all reachable objects, it reclaims the objects that were not traversed.

A simple technique for this traversal is called mark-and-sweep. Every object contains a mark bit which is initially 0. When the traversal function reaches an object, it looks at the mark bit: if the mark bit is 1, the object has already been traversed; if the mark bit is 0, the traversal function sets it to 1 and calls itself recursively on every object to which the current object points to. The mark phase consists of calling the traversal function for every root object. It is followed by the sweep phase, which iterates over all allocated objects: if an object is still marked 0, it is freed; otherwise its mark is reset to 0.

Most garbage collectors out there are based on mark-and-sweep, though usually with considerable added sophistication.

See the wikipedia article for more information and other pointers.

Gilles
This describes the basic operation of all tracing garbage collectors. However, there is another large class of garbage collectors that work quite differently: they are called *reference counting* garbage collectors and how they work is pretty much already described in their name (much like with the tracing collectors, actually). They count incoming references to objects, and when that count reaches 0, the object in question is eligible for garbage collection.
Jörg W Mittag
@Jörg: This is getting dangerously close to a religious war, but many people (me included) don't consider reference counting to be a garbage collector (it may be part of one). Reference counting doesn't collect cyclic garbage (i.e., A points to B points to C points to ... points to A, and none of them are reachable from the roots). This makes reference counting useless for functional programming, for example (it won't collect closures of recursive functions).
Gilles
Sorry... I know the basic principles of GC. The thing I'm hung up on is -- `malloc` is putting memory all over the place. How does the GC know where to find it? That's why I included the snippet.. `myAlloc` saves the addresses of everything it allocates, so that GC knows what to traverse. But it seems to me that what I wrote is fairly inefficient (that array will get very large. deleting from the start/middle of it will suck, etc.) -- so how's it actually done?
Pete
@Peter: the gc doesn't need to know where `malloc` is putting memory. For a gc built on top of `malloc`, `myAlloc` may need to keep track of the allocated block because the gc needs to pass them to `free` when they become unreachable. But it can use a better data structure than an array. If the gc knows how malloc is implemented, it can directly modify the heap to mark unreachable blocks as available, and then `myAlloc` doesn't need to keep track of anything. The gc knows what to traverse from the roots.
Gilles
Ah, I see. Well, since I don't really want to delve into the implementation of `malloc`, would a list be an appropriate data structure for `myAlloc` to keep track of the pointers? The reasoning being that when the GC needs to free one it can just free it, delete the node and stitch the links, rather than shifting the whole array.
Pete
@Peter: Are you writing a gc? If so, worry about getting it correct first, it's not so easy! Also, read some of [Hans Boehm](http://www.hpl.hp.com/personal/Hans_Boehm/gc/index.html)'s papers, he wrote the de facto standard conservative gc for C and C++.
Gilles
@Peter: For a gc on top of `malloc`, the structure in `myAlloc` needs to be a *set* of pointers to allocated block. A hash table or a trie-based data structures would be a reasonable compromise between speed, memory overhead and simplicity. You could use a similar table to keep track of marked blocks during collection.
Gilles