views:

1121

answers:

8

I was told by a rather smart person that you cannot implement garbage collection in C because of it's weakly typed. The basic idea seems to be that C gives you too much freedom. He mentioned casting pointers without type checking...

I don't really grok the idea. Can someone give me an explanation and possibly a code sample of why this wouldn't work.

NOTE: Obviously C is about speed and why would you want to add garbage collection? I'm just curious really.

+12  A: 

He probably referred to the fact that you can cast a pointer to an int and back to the original pointer type. It's pretty much impossible for a GC to clean up correctly when you do that, consider:

char * p = (char *) malloc(16);
int i = (int) p;
p = 0;
// GC runs and finds that the memory is no longer referenced
p = (char *) i;
// p is now a dangling pointer

EDIT: The above will only produce a dangling pointer with a precise GC. As others have pointed out, a conservative collector can still correctly handle this scenario as it assumes that any bit pattern that could be a valid pointer actually is a pointer and will thus not free the memory allocated. However, this is of course no longer possible when i is further modified such that it no longer looks like a valid pointer to the collector, e.g. as follows:

char * p = (char *) malloc(16);
int i = ~((int) p);
p = 0;
// GC runs and finds that the memory is no longer referenced
p = (char *) ~i;
// p is now a dangling pointer

Moreover, (again as others have pointed out) it's only impossible to implement a GC for C if you want to retain the full functionality of the language. If you refrain from using tricks like the above (i.e. you confine yourself to a subset of the possible operations) then GC is indeed feasible.

Andreas Huber
Huh? - When did "new" become part of C?
Tall Jeff
I dig the point though, just replace new with malloc
Tim Merrifield
Nice concise explanation.
Stephen Martin
@Andreas: That's pretty much the answer, but since the question is about C, not C++, you should edit to use malloc (and perhaps a better suitable name than "MyClass") instead of new.
P Daddy
A conservative garbage collector would see that the variable "i" contains a value that points to the start of a valid object, and decide that the object should not be collected. This is how early Java GCs worked.
kdgregory
Brian Rasmussen makes a good point re pointer arithmetic. If you use "i = p++", then the GC would have to be extremely conservative, and assume that any value that points into the heap potentially could be used to free an object.
kdgregory
Andreas Huber
@kdgregory: True, a conservative GC would spot this simple usage, will edit.
Andreas Huber
This is a nice, concise answer to the types of problems that a C garbage collector will run into. But you can have a conservative GC - so it's not necessarily "impossible" to have a C garbage collector, but it is tricky. See http://www.hpl.hp.com/personal/Hans_Boehm/gc/gcdescr.html for some details.
Michael Burr
Minor quibble: 'int i' won't always be large enough to hold a pointer and also possibly creates signed arithmetic problems. uintptr_t is the C99 defined integral type that is guaranteed to be large enough to hold a pointer.
Hudson
@Hudson: You are right, but uintptr_t might not be understood by many readers.
Andreas Huber
Another nice, easy to understand example for how a conservative GC can be fooled. Wish I could +2.
Michael Burr
A: 

In order to garbage collect an object, you need to know that zero variables reference that object. If no one's "interested" in the object, it can be garbage collected.

In C, an array or struct keeps no information to backtrack to all pointer variables that reference it. It doesn't know who is "interested." One possibility would be to search all the process' memory for things that may be pointers to the object, and count those as references.

But how do you know that a given sequence of four bytes is a pointer referencing the object? It could just be coincidence. It could be an integer has happens to have the same value as the memory location of the object. It could be that the four bytes aren't a distinct variable at all, they're just four bytes in the middle of a larger variable like a struct or a char buffer.

You can even turn any random four bytes into a pointer at runtime. For example, the following code prints an integer that is stored at a memory location pointed to by some bytes in the middle of a char buffer.

int w = 0, x = 1, y = 2, z = 3;
char buf[16];
int **p = (int **) buf;
*p++ = &w;
*p++ = &x;
*p++ = &y;
*p++ = &z;
p = (int **) buf;
p++;
printf("%d\n", **p);

The example above prints the value of x. I think that's what is meant by weakly typed.

Bill Karwin
+2  A: 

It is not impossible due to weak typing. Actually C is more a strongly typed language than a weakly typed language. More specifically C it is a statically typed language, but that has little to do with the problem of garbage collection.

The problem is that there's no way for the runtime to know for certain if any piece of memory is referenced or not. Even if you wrap all memory allocation in code that registers the usage, you can still obtain pointers to used memory via regular pointer manipulation (or by mistake). Casts only make the problem harder for the runtime. So if the runtime frees a piece of memory, it will mess things up for any pointers still pointing to that area of memory. Obviously the situation only gets worse when you consider that garbage collection has to work for multi-threaded applications as well.

Brian Rasmussen
Static/dynamic typing, and strong/weak typing, are different things. It is possible for a to be a weak statically-typed language (C), or a strong dynamically-typed language (Python), for example.
mipadi
I know - but I guess that point didn't come through. Strongly/weakly typed are not very well defined whereas statically/dynamically are).
Brian Rasmussen
"Casts only make the problem harder for the runtime." Casting without type checking is not strong typing.
Tim Merrifield
@Tim: Unfortunately there doesn't seem to be exact definitions on strong/weak. So its more levels of gray than real black/white. I would agree that a language with RTTI can be viewed as stronger than one without, but I would say that statically typed languages are strongly typed to some extend.
Brian Rasmussen
@Brian: I agree, there is a lot of confusion around the term strong typing. Though because of C's lack of runtime type checking, I would say it's more weakly typed than its ancestors. I probably should have just left the "weakly typed" verbiage out of the question.
Tim Merrifield
When down voting, please leave a comment. Thanks.
Brian Rasmussen
+9  A: 

It's perfectly possible to implement whatever memory manager you can think of in C. The catch is that you then have to use its allocation/deallocation functions exclusively and restrict your 'pointer magic' to things it can keep track of. Aditionally, the memory management might be restricted to certain supported types.

For example, Objective-C's retain/release system and autorelease pools are basically memory managers implemented in C. Many libraries also implement their own, simple form of memory management like reference counting.

Then, there's the Boehm garbage collector. To use it, just replace your malloc()/realloc() calls with the Boehm versions and you never have to call free() again. Read about the possible issues with this approach.

Also, check this wikipedia page for a quick overview on how conservative garbage collectors work.

Christoph
Other catches are that you will almost certainly want to make all global roots available (e.g. by maintaining a shadow stack) and for multi-threaded programs you'll probably also want to globally change the calling convention to make thread-local data efficiently accessible.
Jon Harrop
+3  A: 

It's impossible to implement a precise garbage collector for C because of the freedoms afforded C's pointers and the fact that the length of a C array is anyone's guess. This means a lot of sophisticated garbage collection approaches can't be used. (Copying and compacting garbage collectors come to mind.)

It is, however, possible to implement a conservative garbage collector (boehm), which basically assumes everything that looks like it might be a pointer is a pointer. This isn't very efficient, but it works for a suitably lenient definition of "works".

bendin
+1  A: 

It's not impossible to implement a garbage collector for C (and in fact, they do exist, as a simple google search reveals), it's just difficult, because it can be difficult to determine if a certain string of bits is a pointer into an allocated block or just looks like one.

The reason this is an issue is because C (and C++, for that matter) allows you to cast from a pointer type to an integral type, so an integer variable might hold an address within an allocated block, preventing the GC from freeing that block, even though that value wasn't intended to be a pointer.

For example, let's say I have a block of memory allocated. Let's suppose this block of memory is allocated starting at address 0x00100000 (1,048,576), and is 1 MB long, so extends to 0x001FFFFF (2,097,151).

Let's say I also am storing the size of an image file in a variable (let's call it fileSize). This image file happens to be 1.5 MB (1,572,864 bytes).

So when the garbage collector runs, it will come across my fileSize variable, find it containing a value that corresponds to an address within my allocated block, and decide that it cannot free this block, lest it invalidate my maybe pointer. That's because the GC doesn't know if I've done this:

int fileSize;
{
    char *mem = (char*)malloc(1048576);
    fileSize = (int)(mem + 524288);
}
// say GC runs here

or if I've just done this:

int fileSize;
{
    char *mem = (char*)malloc(1048576);
    fileSize = 1572864;
}
// say GC runs here;

In the latter case, it is safe to free the block at *mem, (if no other references exist), whereas in the former, it's not. It must be conservative and assume that it's not, so the memory "leaks" (at least until fileSize goes out of scope or is changed to a value outside the allocated block).

But garbage collectors for C (and C++) do exist. Whether or not they are valuable is a matter for a different discussion.

P Daddy
+1  A: 

C is not weakly typed, but this code illustrates the difficulty in building a garbage collector into the language:

#include <stdio.h>
#include <stdlib.h>

int GetSomeMemory() {
    char* pointerToHeapMemory = malloc(10);
    return (int)pointerToHeapMemory;
}

int main() {
    int memoryAddress = GetSomeMemory();

    /* at this point a garbage collector might decide to clear up the memory that
     * was allocated in GetSomeMemory on the grounds that pointerToHeapMemory 
     * is no longer in scope. But the truth is we still know about that memory and 
     * we're about to use it again... */

    char* anotherPointerToHeapMemory = (char*) memoryAddress;

    sprintf(anotherPointerToHeapMemory, "123456789\0");
    printf("%s\n", anotherPointerToHeapMemory);
}

Garbage collection can be done so long as everyone working on a project agrees to avoid this kind of thing and use a common set of functions for allocating and accessing memory. For example, this is a C garbage collector implementation

d4nt
+3  A: 

If you read the right papers and you have a bachelor's degree in CS, it's actually pretty easy to implement a decent conservative garbage collector for C---I have a dozen students who have done it as a class exercise taking about four weeks. Then spend 20 years improving it and you get the Boehm collector (libgc).

The basic idea is simple: if there is a bit pattern anywhere in a register, on the stack, in a global variable, or in a live heap object, and that bit pattern happens to be an address that falls inside an object allocated with malloc, than that object is considered live. Any object that is not live cannot possibly be reached by following pointers, and so it can be reclaimed and used to satisfy future allocation requests. This technique operates on the hardware representation of pointers, and it is completely independent of the type of pointer---types are irrelevant here.

It is true there is a caveat: conservative garbage-collection techniques can be fooled by willfully hiding pointers. Compress pointer-containing structures, keep the only copy of a pointer on disk, obfuscate a pointer by XORing 0xdeadbeef, and all these techniques will break a conservative collector. But this kind of problem is extremely rare unless done deliberately. Authors of optimizing compilers are usually careful not to hide pointers from such a collector.

The most interesting part of your question is why do it. Three reasons:

  • It eliminates the possibility of many memory-manangement bugs.

  • It simplifies your APIs because it is no longer necessary to specify who allocates memory, who owns the allocated memory, whether it's necessary to copy memory, and who is responsible for freeing memory.

  • Believe it or not, it can be faster than using malloc and free.

Norman Ramsey
@Norman: "extremely rare". The authors of the infamous book Numerical Recipes actually did a pointer hack in order to use Fortran-like 1-indexed arrays from C that breaks conservative collectors. However, that is actually illegal according to the C specification...
Jon Harrop