ansaurus

Question

Performance challenge: NAL Unit Wrapping

Answer 1

+1 A:

Applying the obvious optimisations to your code:

#define unlikely(x) __builtin_expect((x),0)

while( src < end )
{
    const char s = *src++;
    if( unlikely(i_count==2 && s<=0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(s==0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = s;
}

2008-09-29 03:35:50

This is actually a very good point--because they are byte pointers, the C compiler has to assume they can alias each other, since char* has no strict aliasing rules. I'll have to try this and see if it helps.

Dark Shikari 2008-09-29 03:37:08

And, unfortunately, your change actually reduces performance by 10% or so (!?). I'd have to look at the disassembly to see exactly why, of course. I'm going to guess its because the first if statement early-terminates, so the "s" almost never needs to be loaded until the last line of the loop.

Dark Shikari 2008-09-29 03:41:25

btw, is the 'else' I inserted correct?

2008-09-29 03:42:49

And on that note, is there any way to make the C compiler know that src and dst cannot alias each other, even though they're char* pointers?

Dark Shikari 2008-09-29 03:44:42

You're right--the else will change the output of the function; its not correct. Unsigned vs signed on the char doesn't matter.

Dark Shikari 2008-09-29 03:45:15

I think __restrict__ does this in GCC. Also, is the performance drop because 's' is (incorrectly) signed?

2008-09-29 03:48:23

No. I changed it to uint8_t when I pasted it into my program. When I fixed your extra else, the performance got even worse--over a 35% slowdown over the original.

Dark Shikari 2008-09-29 03:53:25

Just tried __restrict__; no performance benefit, so either aliasing isn't a problem, or gcc just doesn't take advantage of it.

Dark Shikari 2008-09-29 03:55:16

With the "unlikely", your attempt comes back to exactly match the performance of the original code. Interesting... by the way, thanks for the Unlikely macro, I may try this elsewhere in the codebase.

Dark Shikari 2008-09-29 03:59:04

Lol...ok, I give up.

2008-09-29 04:00:17

Answer 2

A:

Mike F, thanks a lot for the "unlikely" suggestion: the following is about 10% faster than the original:

#define unlikely(x) __builtin_expect((x),0)
while( src < end )
{
    if( unlikely(i_count == 2) && unlikely(*src <= 0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(*src == 0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = *src++;
}

And, even better, the following is 50% faster than the original (!!!) I still can't figure out why the likely() in the loop condition helps at all... must be gcc being weird again.

#define unlikely(x) __builtin_expect((x),0)
#define likely(x) __builtin_expect((x),1)
while( likely(src < end) )
{
    if( unlikely(i_count == 2) && unlikely(*src <= 0x03) )
    {
        *dst++ = 0x03;
        i_count = 0;
    }
    if( unlikely(*src == 0) )
        i_count++;
    else
        i_count = 0;
    *dst++ = *src++;
}

However, I was hoping for more than just optimizing the current naive approach; I suspect there must be a better way to do it than merely handling every byte individually.

Dark Shikari 2008-09-29 04:07:09

Answer 3

A:

while (src < end-2)
{
 if ((src[0] == 0) && (src[1] == 0) && (src[2] <= 3))
 {
  dst[0] = 0;
  dst[1] = 0;
  dst[2] = 3;
  dst[3] = src[2];
  src += 3;
  dst += 4;
 }
 else
  *dst++ = *src++;
}
while (src < end)
 *dst++ = *src++;

Mark Ransom 2008-09-29 04:19:15

Are you sure you don't mean dst[3] = src[2];?

Dark Shikari 2008-09-29 04:23:10

Thanks, I'll correct it. Stupid typo.

Mark Ransom 2008-09-29 04:35:11

Answer 4

+2 A:

Hmm...how about something like this?

#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

while( likely(src < end) )
{
    //Copy non-zero run
    int runlen = strlen( src );
    if( unlikely(src+runlen >= end) )
    {
        memcpy( dest, src, end-src );
        dest += end-src;
        src = end;
        break;
    }

    memcpy( dest, src, runlen );
    src += runlen;
    dest += runlen;

    //Deal with 0 byte
    if( unlikely(src[1]==0 && src[2]<=3 && src<=end-3) )
    {
        *dest++ = 0;
        *dest++ = 0;
        *dest++ = 3;
        *dest++ = *src++;
    }
    else
    {
        *dest++ = 0;
        src++;
    }
}

There's some duplication of effort between strcpy and memcpy it'd be nice to get rid of though.

2008-09-29 04:21:34

Impressive, over 40% faster than both mine above and Mark Ransom's. I can't guarantee it gives correct results, as I'd need to test on an extremely large dataset to make sure, but an initial test shows its probably right.

Dark Shikari 2008-09-29 04:26:37

Ok, final edit and now I'm *really* done. Fun question BTW...

2008-09-29 04:34:23

No, you're not done; src<=end-3 needs to be the first check in the if, since you run the risk of overflowing the end of the array if you don't. Also, unlikely/likely are totally unnecessary here because almost all time is spent in memcpy/strlen, so there's no measurable benefit to them.

Dark Shikari 2008-09-29 04:36:20

Yeah, Mark's approach of handling the last 3 bytes outside the mainloop is probably better.

2008-09-29 04:39:31

Yeah, it'd be nice to merge strlen and memcpy. memccpy is perfect for this, but as a fellow developer of mine said, you can't rely on libc implementations for really obscure functions to be fast (and, unsurprisingly, its slower than your method).

Dark Shikari 2008-09-29 04:50:41

Aye, strlcpy would do the job too, same objection (and I don't think it's even in glibc).

2008-09-29 04:54:39

Answer 5

A:

Faster version of my previous answer:

while (src < end-2)
{
 if (src[0] == 0)
 {
  if (src[1] == 0)
  {
   if (src[2] <= 3)
   {
    dst[0] = 0;
    dst[1] = 0;
    dst[2] = 3;
    dst[3] = src[2];
    src += 3;
    dst += 4;
   }
   else
   {
    dst[0] = 0;
    dst[1] = 0;
    dst[2] = src[2];
    src += 3;
    dst += 3;
   }
  }
  else
  {
   dst[0] = 0;
   dst[1] = src[1];
   src += 2;
   dst += 2;
  }
 }
 else
  *dst++ = *src++;
}
while (src < end)
 *dst++ = *src++;

Mark Ransom 2008-09-29 04:34:40

Not faster here, probably just because the initial condition for the if only happens 1 in 256 cases, so the benefit from the extra branching is pretty much zero.

Dark Shikari 2008-09-29 04:38:03

Running this with MS VC6, 100 passes over the same 1000000 random bytes, takes 0.287 seconds. My original takes 0.345 seconds. Yours takes 0.442 seconds.

Mark Ransom 2008-09-29 04:46:46

Should mention that I have an AMD x64 processor, that probably makes a difference too.

Mark Ransom 2008-09-29 04:47:30

The MSVC++ compiler has wildly different performance characteristics to GCC too, which is probably the main cause of the differences.

2008-09-29 04:59:51

Answer 6

A:

A test like this is insanely dependent on your compiler and processor/memory setup. On my system, my improved version is the exact same speed as Mike F's strlen/memcpy version.

Mark Ransom 2008-09-29 04:55:38

Ah snap, just commented the same thing.

2008-09-29 05:00:58

I'd say its much more dependent on your libc, in the case of the strlen/memcpy version. A good libc will give far better results than any of the naive byte-based versions posted here, but a bad libc will probably be worse.

Dark Shikari 2008-09-29 05:07:50

Answer 7

A:

It seems to me that as long as you're copying byte-by-byte you'll be falling foul of word alignment issues when accessing memory, and these will be slowing you down.

So, I'd suggest the following (sorry for the pseudo code, my C/C++ is largely forgotten):

Search to find the next insertion point
[The Boyer-Moore search algorithm would be a good bet, as it's sublinear (doesn't need to examine every byte)]
Block copy the unchanged block
[My understanding is that GCC and other good C++ compilers can turn the memcpy() call directly into the correct processor instruction, giving near optimal performance]
Insert the altered code
Repeat until done

Bevan 2010-04-30 00:35:06

ansaurus

tags:

views:

answers:

Performance challenge: NAL Unit Wrapping

related questions