views:

762

answers:

5

I've written several copy functions in search of a good memory strategy on PowerPC. Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. Initially pleased with that, I threw in a regular memcpy to see how it compared... 10x faster than my best! I have no intention of rewriting memcpy, but I do hope to learn from it and accelerate several simple image filters that spend most of their time moving pixels to and from memory.

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration, the performance advantage of memcpy is still embarrassing. I'm using dcbz to free up bandwidth, Apple uses nothing, but both codes tend to hesitate on stores.

prefetch
  dcbt future
  dcbt distant future
load stuff
  lvx image
  lvx image + 16
  lvx image + 32
  lvx image + 48
  image += 64
prepare to store
  dcbz filtered
  dcbz filtered + 32
store stuff
  stvxl filtered
  stvxl filtered + 16
  stvxl filtered + 32
  stvxl filtered + 48
  filtered += 64
repeat

Does anyone have some ideas on why very similar code has such a dramatic performance gap? I'd love to marinate the real image filters in whatever secret sauce memcpy is using!

Additional info: All data is vector aligned. I'm making filtered copies of the image, not replacing the original. The code runs on PowerPC G4, G5, and Cell PPU. The Cell SPU version is already insanely fast.

A: 

Maybe it's because of CPU caching. Try to run CacheGrind:

Cachegrind is a cache profiler. It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code. It identifies the number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and whole-program summaries. It is useful with programs written in any language. Cachegrind runs programs about 20--100x slower than normal.

Andreas Bonini
CacheGrind absolutely does not work on PPC/Darwin.
Nick Bastin
@Nick, are you sure? http://en.wikipedia.org/wiki/Valgrind "As of version 3.4.0, Valgrind supports Linux on x86, x86-64 and PowerPC"
Andreas Bonini
@Andreas: It works on *linux*, but definitely not Darwin. The only supported (and barely) Darwin is x86.
Nick Bastin
Also, I'd imagine it's a pretty low priority, since Shark on PPC should give you the same insight.
Nick Bastin
I'd be lost without Shark, but it doesn't help so much with detailed cache data. It shows the ripples on the surface (this instruction stubbed its toe), but not the monster beneath (on what?).
Invisible Cow
+2  A: 

I don't know exactly what you're doing, since I can't see your code, but Apple's secret sauce is here.

Nick Bastin
I could see the disassembly in Shark, so know what they're doing in the copy loop. Just wondering what's there before that loop that seems to kick it into overdrive. That code should help, so thanks for the link!
Invisible Cow
@Invisible Cow: Yeah, I was just hoping that would provide a bit more context (and comments) that might be insightful.
Nick Bastin
Added some code to the question, for the G4 and its 32-byte cachelines.
Invisible Cow
Actually, that's an in-kernel bcopy. The user mode one the OP is seeing is probably the one at http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/commpage/bcopy_970.s
ohmantics
+4  A: 

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration

I may be stating the obvious, but since you don't mention the following at all in your question, it may be worth pointing it out:

I would bet that Apple's choice of 4 vectors reads followed by 4 vector writes has as much to do with the G5's pipeline and its management of out-of-order instruction execution in "dispatch groups" as it has with a magical 64-byte perfect line size. Did you notice the line skips in Nick Bastin's linked bcopy.s? These mean that the developer thought about how the instruction stream would be consumed by the G5. If you want to reproduce the same performance, it's not enough to read data 64 bytes at a time, you must make sure your instruction groups are well filled (basically, I remember that instructions can be grouped by up to five independent ones, with the first four being non-jump instructions and the fifth only being allowed to be a jump. The details are more complicated).

EDIT: you may also be interested by the following paragraph on the same page:

The dcbz instruction still zeros aligned 32 byte segments of memory as per the G4 and G3. However, since that is not a full cacheline on a G5 it will not have the performance benefits that you were likely hoping for. There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline.

Pascal Cuoq
I had not thought of dispatch groups. The whole "instruction soup" of the G5 has always perplexed me, and I much prefer working with the Cell, simply because its execution model fits in my head. As for the edit, the code already differs for the larger cachelines.
Invisible Cow
A: 

Still not an answer, but did you verify that memcpy is actually moving the data? Maybe it was just remapped copy-on-write. You would still see the inner memcpy loop in Shark as part of the first and last pages are truly copied.

Potatoswatter
A: 

As mentioned in another answer, "dcbz", as defined by Apple on the G5, only operates on 32-bytes, so you will lose performance with this instruction on a G5 which has 128 byte cachelines. You need to use "dcbzl" to prevent the destination cacheline from being fetched from memory (and effectively reducing your useful read memory bandwidth by half).

JanePhanie
And don't forget - you should only use 1 "dcbzl" per 128 byte line. It appears that your code is doing a "dcbz" every 32 bytes.
JanePhanie