Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.
What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.
Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm
They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.
Another improvement would be hardware assisted write barriers for keeping track of remembered sets.
Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.
The problem is this means ANY time the mutator (fancy word for the 'real program') sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.