views:

332

answers:

7

From what I've seen in the past, StackOverflow seems to like programming challenges, such as the fast char to string exercise problem which got dozens of responses. This is an optimization challenge: take a very simple function and see if you can come up with a smarter way of doing it.

I've had a function that I've wanted to further optimize for quite some time but I always find that my optimizations have some hole that result in incorrect output--some rare special case in which they fail. But, given the function, I've always figured one should be able to do better than this.

The function takes an input datastream (effectively random bits, from an entropy perspective) and wraps it into a NAL unit. This involves placing escape codes: any byte sequence of 00 00 00, 00 00 01, 00 00 02, or 00 00 03 gets replaced with 00 00 03 XX, where XX is that last byte of the original sequence. As one can guess, these only get placed about 1 in every 4 million bytes of input, given the odds against such a sequence--so this is a challenge where one is searching an enormous amount of data and doing almost nothing to it except in very rare cases. However, because "doing something" involves inserting bytes, it makes things a bit trickier. The current unoptimized code is the following C:

src and dst are pointers to arrays of bytes, and end is the pointer to the end of the input data.

int i_count = 0;
while( src < end )
{
    if( i_count == 2 && *src <= 0x03 )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( *src == 0 )
        i_count++;
    else
        i_count = 0;
    *dst++ = *src++;
}

Common input sizes to this function range from roughly between 1000 and 1000000 bytes of data.

Initial ideas of mine include a function which (somehow) quickly searches the input for situations where an escape code is needed, to avoid more complex logic in the vast majority of inputs where escape codes don't need to be placed.


+1  A: 

Applying the obvious optimisations to your code:

#define unlikely(x) __builtin_expect((x),0)

while( src < end )
{
    const char s = *src++;
    if( unlikely(i_count==2 && s<=0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(s==0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = s;
}
This is actually a very good point--because they are byte pointers, the C compiler has to assume they can alias each other, since char* has no strict aliasing rules. I'll have to try this and see if it helps.
Dark Shikari
And, unfortunately, your change actually reduces performance by 10% or so (!?). I'd have to look at the disassembly to see exactly why, of course. I'm going to guess its because the first if statement early-terminates, so the "s" almost never needs to be loaded until the last line of the loop.
Dark Shikari
btw, is the 'else' I inserted correct?
And on that note, is there any way to make the C compiler know that src and dst cannot alias each other, even though they're char* pointers?
Dark Shikari
You're right--the else will change the output of the function; its not correct. Unsigned vs signed on the char doesn't matter.
Dark Shikari
I think __restrict__ does this in GCC. Also, is the performance drop because 's' is (incorrectly) signed?
No. I changed it to uint8_t when I pasted it into my program. When I fixed your extra else, the performance got even worse--over a 35% slowdown over the original.
Dark Shikari
Just tried __restrict__; no performance benefit, so either aliasing isn't a problem, or gcc just doesn't take advantage of it.
Dark Shikari
With the "unlikely", your attempt comes back to exactly match the performance of the original code. Interesting... by the way, thanks for the Unlikely macro, I may try this elsewhere in the codebase.
Dark Shikari
Lol...ok, I give up.
A: 

Mike F, thanks a lot for the "unlikely" suggestion: the following is about 10% faster than the original:

#define unlikely(x) __builtin_expect((x),0)
while( src < end )
{
    if( unlikely(i_count == 2) && unlikely(*src <= 0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(*src == 0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = *src++;
}

And, even better, the following is 50% faster than the original (!!!) I still can't figure out why the likely() in the loop condition helps at all... must be gcc being weird again.

#define unlikely(x) __builtin_expect((x),0)
#define likely(x) __builtin_expect((x),1)
while( likely(src < end) )
{
    if( unlikely(i_count == 2) && unlikely(*src <= 0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(*src == 0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = *src++;
}

However, I was hoping for more than just optimizing the current naive approach; I suspect there must be a better way to do it than merely handling every byte individually.

Dark Shikari
A: 
while (src < end-2)
{
 if ((src[0] == 0) && (src[1] == 0) && (src[2] <= 3))
 {
  dst[0] = 0;
  dst[1] = 0;
  dst[2] = 3;
  dst[3] = src[2];
  src += 3;
  dst += 4;
 }
 else
  *dst++ = *src++;
}
while (src < end)
 *dst++ = *src++;
Mark Ransom
Are you sure you don't mean dst[3] = src[2];?
Dark Shikari
Thanks, I'll correct it. Stupid typo.
Mark Ransom
+2  A: 

Hmm...how about something like this?

#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

while( likely(src < end) )
{
    //Copy non-zero run
    int runlen = strlen( src );
    if( unlikely(src+runlen >= end) )
    {
        memcpy( dest, src, end-src );
        dest += end-src;
        src = end;
        break;
    }

    memcpy( dest, src, runlen );
    src += runlen;
    dest += runlen;

    //Deal with 0 byte
    if( unlikely(src[1]==0 && src[2]<=3 && src<=end-3) )
    {
        *dest++ = 0;
        *dest++ = 0;
        *dest++ = 3;
        *dest++ = *src++;
    }
    else
    {
        *dest++ = 0;
        src++;
    }
}

There's some duplication of effort between strcpy and memcpy it'd be nice to get rid of though.

Impressive, over 40% faster than both mine above and Mark Ransom's. I can't guarantee it gives correct results, as I'd need to test on an extremely large dataset to make sure, but an initial test shows its probably right.
Dark Shikari
Ok, final edit and now I'm *really* done. Fun question BTW...
No, you're not done; src<=end-3 needs to be the first check in the if, since you run the risk of overflowing the end of the array if you don't. Also, unlikely/likely are totally unnecessary here because almost all time is spent in memcpy/strlen, so there's no measurable benefit to them.
Dark Shikari
Yeah, Mark's approach of handling the last 3 bytes outside the mainloop is probably better.
Yeah, it'd be nice to merge strlen and memcpy. memccpy is perfect for this, but as a fellow developer of mine said, you can't rely on libc implementations for really obscure functions to be fast (and, unsurprisingly, its slower than your method).
Dark Shikari
Aye, strlcpy would do the job too, same objection (and I don't think it's even in glibc).
A: 

Faster version of my previous answer:

while (src < end-2)
{
 if (src[0] == 0)
 {
  if (src[1] == 0)
  {
   if (src[2] <= 3)
   {
    dst[0] = 0;
    dst[1] = 0;
    dst[2] = 3;
    dst[3] = src[2];
    src += 3;
    dst += 4;
   }
   else
   {
    dst[0] = 0;
    dst[1] = 0;
    dst[2] = src[2];
    src += 3;
    dst += 3;
   }
  }
  else
  {
   dst[0] = 0;
   dst[1] = src[1];
   src += 2;
   dst += 2;
  }
 }
 else
  *dst++ = *src++;
}
while (src < end)
 *dst++ = *src++;
Mark Ransom
Not faster here, probably just because the initial condition for the if only happens 1 in 256 cases, so the benefit from the extra branching is pretty much zero.
Dark Shikari
Running this with MS VC6, 100 passes over the same 1000000 random bytes, takes 0.287 seconds. My original takes 0.345 seconds. Yours takes 0.442 seconds.
Mark Ransom
Should mention that I have an AMD x64 processor, that probably makes a difference too.
Mark Ransom
The MSVC++ compiler has wildly different performance characteristics to GCC too, which is probably the main cause of the differences.
A: 

A test like this is insanely dependent on your compiler and processor/memory setup. On my system, my improved version is the exact same speed as Mike F's strlen/memcpy version.

Mark Ransom
Ah snap, just commented the same thing.
I'd say its much more dependent on your libc, in the case of the strlen/memcpy version. A good libc will give far better results than any of the naive byte-based versions posted here, but a bad libc will probably be worse.
Dark Shikari
A: 

It seems to me that as long as you're copying byte-by-byte you'll be falling foul of word alignment issues when accessing memory, and these will be slowing you down.

So, I'd suggest the following (sorry for the pseudo code, my C/C++ is largely forgotten):

  • Search to find the next insertion point
    [The Boyer-Moore search algorithm would be a good bet, as it's sublinear (doesn't need to examine every byte)]
  • Block copy the unchanged block
    [My understanding is that GCC and other good C++ compilers can turn the memcpy() call directly into the correct processor instruction, giving near optimal performance]
  • Insert the altered code
  • Repeat until done
Bevan