Hardware Assisted Garbage Collection

views:

488

answers:

+9 Q:

Hardware Assisted Garbage Collection

I was thinking on ways that functional languages could be more tied directly to their hardware and was wondering about any hardware implementations of garbage collection.

This would speed things up significantly as the hardware itself would implicitly handle all collection, rather than the runtime of some environment.

Is this what LISP Machines did? Has there been any further research into this idea? Is this too domain specific?

Thoughts? Objections? Discuss.

Why would it "speed things up"? What exactly should the hardware be doing? It still has to traverse all references between objects, which means it has to run through a big chunk of data in main memory, which is the main reason why it takes time. What would you gain by this? Which particular operation is it that you think could be done faster with hardware support? Most of the work in a garbage collector is just following pointers/references, and occasionally copying live objects from one heap to another. how is this different from the instructions a CPU already supports?

With that said, why do you think we need faster garbage collection? A modern GC can already be very fast and efficient, to the point where it's basically a solved problem.

jalf 2009-02-12 16:58:09

what 'tude? I simply asked for more details on what you thought it'd achieve (and why it would be beneficial). Sorry if it sounded harsh. I simply don't see what it is you want to hardware-accelerate exactly.

jalf 2009-02-12 19:44:41

But when you say that "this would speed things up significantly", I think it's fair game to ask *why*. Don't you?

jalf 2009-02-12 19:46:57

I was under the impression when things are usually implemented in a hardware version, it tends to outpace the software version. Not in all cases, as I have seen documentation of virtualized functions running faster than their hardware counterparts. You just came off as snobby dude.

Nicholas Mancuso 2009-02-13 00:32:45

Just tell me why you think its an going to be slower. I'll stop because I dont want to cause a flame war or something. But Other than _speed_ perhaps give some reasons why it shouldn't be done. Because it seems there are gobs of papers that have done prototypes of these.

Nicholas Mancuso 2009-02-13 00:33:44

IBM wouldn't be pouring millions of dollars into Metronome, Sun wouldn't be scrapping the ConcLowPause collector for Garbage-First if it was a 'solved problem'.Sure, its solved on commodity hardware, but you talk about 64 gig heaps, mutlicore, and reducing pause times, we have a long ways to go.

runT1ME 2009-02-13 19:57:22

+2 A:

I'm pretty sure that some prototypes should exist. But develop, and produce hardware specific features is very expensive. It took very long time to implement MMU or TLB at a hardware level, which are quite easy to implement.

GC is too big to be efficiently implemented into hardware level.

Jérôme 2009-02-12 16:59:28

Things like OpenGL and virtualization are arguably too big to be implemented in hardware, too, but you can add *some* hardware support which makes it possible for these to go really fast. He did say "hardware *assisted*", not "hardware-only".

Ken 2009-02-12 21:01:58

I'd suggest that if one takes a soft core and butchers the MMU hardware assisted GC would be possible

Tim Williscroft 2009-03-13 00:41:05

The argument that you can't implement the exact same algorithm is absurd. It is the same as saying "you can't implement a machine which can implement that algorthm". In this case such a machine already exist because a cpu is such a machine, it's just not optimized for that particular task. A counter is one of the simplist constructs that can be implement in logic. The rest is a combination of cashing, virtual memory, and possibly DMA.

NoMoreZealots 2009-08-15 03:06:50

+2 A:

One obvious solution was to have memory pointers which are larger than your available RAM, for example, 34bit pointers on a 32 bit machine. Or use the uppermost 8 bits of a 32bit machine when you have only 16MB of RAM (2^24). The Oberon machines at the ETH Zurich used such a scheme with a lot success until RAM became too cheap. That was around 1994, so the idea is quite old.

This gives you a couple of bits where you can store object state (like "this is a new object" and "I just touched this object"). When doing the GC, prefer objects with "this is new" and avoid "just touched".

This might actually see a renaissance because no one has 2^64 bytes of RAM (= 2^67 bits; there are about 10^80 ~ 2^240 atoms in the universe, so it might not be possible to have that much RAM ever). This means you could use a couple of bits in todays machines if the VM can tell the OS how to map the memory.

Aaron Digulla 2009-02-12 16:59:53

I seem to remember the original Apple Macs used to do something similar: 32bit ptrs, but top 8 bits were some sort of tag and only 24bits address. Some things were allowed to be relocated to avoid fragmentation. They had to get rid of it by time >16MByte machines appeared!

timday 2009-02-12 21:11:39

Actually, there are about 10^80 atoms in the universe, according to Wikipedia.

Matt Olenik 2009-03-13 01:22:16

I've updated my cache (in my head).

Aaron Digulla 2009-03-13 08:19:08

+1 A:

Probably the most relevant piece of data needed here is, how much time (percentage of CPU time) is presently being spent on garbage collection? It may be a non-problem.

If we do go after this, we have to consider that the hardware is fooling with memory "behind our back". I guess this would be "another thread", in modern parlance. Some memory might be unavailable if it were being examined (maybe solvable with dual-port memory), and certainly if the HWGC is going to move stuff around, then it would have to lock out the other processes so it didn't interfere with them. And do it in a way that fits into the architecture and language(s) in use. So, yeah, domain specific.

Look what just appeared... in another post... Looking at java's GC log.

gbarry 2009-02-12 17:02:07

+2 A:

There was an article on lambda the ultimate describing how you need a GC-aware virtual memory manager to have a really efficient GC, and VM mapping is done mostly by hardware these days. Here you are :)

Nemanja Trifunovic 2009-02-12 17:06:41

+2 A:

Yes. Look at the related work sections of these 2 papers:

https://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/index.htm http://www.filpizlo.com/papers/pizlo-ismm2007-stopless.pdf

ja 2009-02-12 20:50:01

I gather the biggest problem is CPU registers and the stack. One of the things you have to do during GC is traverse all the pointers in your system, which means knowing what those pointers are. If one of those pointers is currently in a CPU register then you have to traverse that as well. Similarly if you have a pointer on the stack. So every stack frame has to have some sort of map saying what is a pointer and what isn't, and before you do any GC traversing you have to get any pointers out into memory.

You also run into problems with closures and continuations, because suddenly your stack stops being a simple LIFO structure.

The obvious way is to never hold pointers on the CPU stack or in registers. Instead you have each stack frame as an object pointing to its predecessor. But that kills performance.

Paul Johnson 2009-02-12 22:26:12

Well, it goes with out saying GC would have to integrated into the processor and work in combination with the MMU and cache system.

NoMoreZealots 2009-08-15 03:30:56

+4 A:

Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.

What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.

Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.

Another improvement would be hardware assisted write barriers for keeping track of remembered sets.

Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.

The problem is this means ANY time the mutator (fancy word for the 'real program') sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.

runT1ME 2009-02-13 20:05:45

azulsystems look interesting

Ian Ringrose 2009-03-13 09:05:01

I was wrong, Garbage First scans/marks the whole heap during a GC but only copies (or doesn't copy) sections at a time.

runT1ME 2009-03-16 23:48:34

+2 A:

Older sparc systems had tagged memory ( 33 bits) which was usable to mark addresses. Not fitted today ?

This came from their LISP heritage IIRC.

One of my friends built a generational GC that tagged by stealing a bit from primitives. It worked better.

So, yes it can be done, but nodody bothers tagging things anymore.

runT1mes' comments about hardware assisted generational GC are worth reading.

given a sufficiently big gate array (vertex-III springs to mind) an enhanced MMU that supported GC activities would be possible.

Tim Williscroft 2009-03-13 00:39:59

+1 A:

You're a grad student, sounds like a good topic for a research grant to me. Look into FPGA design and computer architechture there are plenty of free processor design availiable on http://www.opencores.org/

Garbage collection can be implemented as a background task, it's already intended for parallel operation.

Pete

NoMoreZealots 2009-08-15 03:27:02

ansaurus

tags:

views:

answers:

Hardware Assisted Garbage Collection

related questions