tags:

views:

926

answers:

13

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

The result being that the chunk of memory at DESTINATION is completely blank filled. I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.

Why do memmove and memcpy behave this way?

Is there any reasonable way to do this sort of array initialization?

I am already aware of char array[size] = {0} for array initialization

I am already aware that memset will do the job for single characters.

What other approaches are there to this issue?

+3  A: 

Why do memmove and memcpy behave this way?

Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)

The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.

Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?

Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.

Konrad Rudolph
I'm not looking for a z80 compiler. I'm looking for and ldir like method to initialize buffers.
EvilTeach
I didn't answer since I don't understand the issue. The way to initialise buffers in C++ is with std::fill (or memset, or wmemset, or non-portable equivalents for larger values). Why don't you like that? What's the motivation for the requirement "must be LDIR-like", are you just fond of the idiom?
Steve Jessop
+1  A: 

There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.

codelogic
+8  A: 

memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.

If you want to fill memory, use memset, which is designed to do just what you want.

Ned Batchelder
+3  A: 

If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.

Greg Hewgill
+6  A: 

I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.

Onorio Catenacci
+1  A: 

memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).

Ferruccio
Why should memcpy have that behaviour? On most hardware, I would be deeply disappointed with a memcpy so unoptimised that it in effect picks up and puts down one byte at a time, which is what this LDIR use relies on but which the C standard functions do not offer.
Steve Jessop
+4  A: 

The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.

Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.

Mark Ransom
+1  A: 

This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

However, it is simply more efficient to set more than one byte at a time if you can.

Finally, memscpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.

Evan Teran
+1  A: 

If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.

It might be that memcpy() might also use a similar sequence on that compiler.

But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?

Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.

Michael Burr
+5  A: 

There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:

  • saves the Stack Pointer and then moves it to the end of the screen.
  • LOADs the HL register pair with zero,
  • goes into a massive loop PUSHing HL onto the Stack.
  • The Stack moves up the screen and down through memory and in the process, clears the screen.

The explanation above was taken from the review of David Webbs game Starion.

The Z80 routine might look a little like this:

  LD DE, SP       
  LD HL, 0
  LD BC, 0x1800   ; Size of screen
  LD SP, 0x4000   ; Start of screen
LOOP:
  PUSH HL
  DEC BC
  JNZ LOOP
  LD SP, DE
devstopfix
It's been mumble years since I did anything with a Z80, but that looks good to me. I'd add 'LD SP, DE' at the end, of course.
David Thornley
Interesting technique.Thank you.
EvilTeach
An even faster way I have used is to put several "PUSH HL" instructions inside the loop. So, if you were clearing, say, 2K of memory, you might use 16 "PUSH HL"s and only loop around 2K/16 (256) times.
Mike Thompson
DEC does not set zero flag. Actually it doesn't set any flag.
ssg
16-bit DEC doesn't set any flags, but 8-bit DEC does. Rewriting the loop to an inner loop over C and an outer loop over B would take care of that problem, as would use of DJNZ which is IIRC faster then DEC B; JNZ LOOP separately. Of course, that would require the inner loop be over B...
RBerteig
A: 

Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

The assembler output is the following:

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H    ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5    ; MaxLength()
push    eax
call    useA

It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.

Andreas Magnusson
If that output was from objedump, you should pass the -C option, it'll decode c++ names :)
Evan Teran
Thanks, but the output was directly from the compiler, I could of course tidy it up a bit...
Andreas Magnusson
Your example is not a good one because the compiler detects that the array has only 5 bytes, so does a 4 byte and a 1 byte store operation from eax. This would look quite different when using a significantly bigger array size.
karx11erx
Of course, but the whole point was to illustrate that fancy-pancy optimization techniques used in the olden days writing Z80 asm is not necessary any longer. Using a larger value for S would yield a call to memset(), which would most likely do rep stosd (+ alignment).
Andreas Magnusson
A: 

As said before, memset() offers the desired functionality.

memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.

memmove() solves the case of buffers overlapping and dest > source.

On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):

lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:

Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.

If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.

karx11erx
+1  A: 

If you're on the PowerPC, _dcbz().

Crashworks