views:

1238

answers:

15

I have 2 arrays of 16 elements (chars) that I need to "compare" and see how many elements are equal between the two.

This routine is going to be used millions of times (a usual run is about 60 or 70 million times), so I need it to be as fast as possible. I'm working on C++ (C++Builder 2007, for the record)

Right now, I have a simple:

matches += array1[0] == array2[0];

repeated 16 times (as profiling it appears to be 30% faster than doing it with a for loop)

Is there any other way that could work faster?

Some data about the environment and the data itself:

  • I'm using C++Builder, which doesn't have any speed optimizations to take into account. I will try eventually with another compiler, but right now I'm stuck with this one.
  • The data will be different most of the times. 100% equal data is usually very very rare (maybe less than 1%)
A: 

Is it faster as one statement?

matches += (array1[0] == array2[0]) + (array1[1] == array2[1]) + ...;
Joe Skora
+2  A: 

If you have the ability to control the location of the arrays, putting one right after the other in memory for instance, it might cause them to be loaded to the CPU's cache on the first access.

It depends on the CPU and its cache structure and will vary from one machine to another.

You can read about memory hierarchy and cache in Henessy & Patterson's Computer Architecture: A Quantitative Approach

Asaf R
A: 

If writing that 16 times is faster than a simple loop, then your compiler either sucks or you don't have optimization turned on.

Short answer: there's no faster way, unless you do vector operations on parallel hardware.

Ben Collins
Yeah, Borland C++ compilers suck for optimizations.I might move this code to MSVC and try it there, I've some experience with the same code being much much faster in the MS compiler than the Borland/CodeGear one.
Rodrigo Gómez
+1  A: 

Magical compiler options will vary the time greatly. In particular making it generate SSE vectorization will likely get you a huge speedup.

Greg Rogers
+2  A: 

If you need absolute lowest footprint, I'd go with assembly code. I haven't done this in a while but I'll bet MMX (or more likely SSE2/3) have instructions that can enable you to do exactly that in very few instructions.

Tomer Gabel
+2  A: 

If matches are the common case then try loading the values as 32 bit ints instead of 16 so you can compare 2 in one go (and count it as 2 matches).

If the two 32 bit values are not the same then you will have to test them separately (AND out the top and bottom 16 bit values).

The code will be more complex, but should be faster.

If you are targeting a 64-bit system you could do the same trick with 64 bit ints, and if you really want to push the limit then look at dropping into assembler and using the various vector based instructions which would let you work with 128 bits at once.

Rob Walker
Thanks Rob. I've just tried something similar, the code posted by Andrew, and it doesn't speed things up.Usually matches won't be common.
Rodrigo Gómez
A: 

Try using pointers instead of arrays:

p1 = &array1[0];
p2 = &array2[0];
match += (*p1++ == *p2++);
// copy 15 times.

Of course you must measure this against other approaches to see which is fastest.

And are you sure that this routine is a bottleneck in your processing? Do you actually speed up the performance of your application as a whole by optimizing this? Again, only measurement will tell.

Justsalt
I'm sure this is the bottleneck. I've been profiling this using AQTime, and this function represents about 65% of the total runtime of the process. The other 25% is the function that calls this, and that's the one that "splits" the big arrays into arrays of 16 elements.
Rodrigo Gómez
Note: "pointers instead of arrays" is not always a good idea. A good optimizing compiler can work better on array+indices than on pointer access.I suggest coding both, measuring both and keeping the simplest one (IMHO the array). YMMV, of course.
rlerallut
+6  A: 

The key is to do the comparisons using the largest register your CPU supports, then fallback to bytes if necessary.

The below code demonstrates with using 4-byte integers, but if you are running on a SIMD architecture (any modern Intel or AMD chip) you could compare both arrays in one instruction before falling back to an integer-based loop. Most compilers these days have intrinsic support for 128-bit types so will NOT require ASM.

(Note that for the SIMD comparisions your arrays would have to be 16-byte aligned, and some processors (e.g MIPS) would require the arrays to be 4-byte aligned for the int-based comparisons.

E.g.

int* array1 = (int*)byteArray[0];
int* array2 = (int*)byteArray[1];

int same = 0;

for (int i = 0; i < 4; i++)
{
  // test as an int
  if (array1[i] == array2[i])
  {
    same += 4;
  }
  else
  {
    // test individual bytes
    char* bytes1 = (char*)(array1+i);
    char* bytes2 = (char*)(array2+i);

    for (int j = 0; j < 4; j++)
    {
      same += (bytes1[j] == bytes2[j];
    }
  }
}

I can't remember what exactly the MSVC compiler supports for SIMD, but you could do something like;

// depending on compiler you may have to insert the words via an intrinsic
__m128 qw1 = *(__m128*)byteArray[0];
__m128 qw2 = *(__m128*)byteArray[1];

// again, depending on the compiler the comparision may have to be done via an intrinsic
if (qw1 == qw2)
{
    same = 16;
}
else
{
    // do int/byte testing
}
Andrew Grant
I've just tried this one, and it doesn't make things faster. for loops with BCB really suck, and, on the other hand, most of the int comps are false, so one still needs to check byte by byte.Thanks for the idea though. I'll try it again when moving this to a MSVC dll.
Rodrigo Gómez
Rodgrigo, you can obviously unroll the for loops.
Andrew Grant
A: 

Is there any way you can modify the way the arrays are stored? Comparing 1 byte at a time is extremely slow considering you are probably using a 32-bit compiler. Instead if you stored your 16 bytes in 4 integers (32-bit) or 2 longs (64-bit), you would only need to perform 4 or 2 comparisons respectively.

The question to ask yourself is how much is the cost of storing the data as 4-integer or 2-long arrays. How often do you need to access the data, etc.

The issue here is that I don't need to just see if the 16-bytes are equal or not, but how similar are they. Usually they wont be 100% equal, so comparing them as ints or longs usually won't help much (I just tried something similar and it didn't help)Thanks anyway.
Rodrigo Gómez
A: 

There's always the good old x86 REPNE CMPS instruction.

moonshadow
+6  A: 

UPDATE: This answer has been modified to make my comments match the source code provided below.

There is an optimization available if you have the capability to use SSE2 and popcnt instructions.

16 bytes happens to fit nicely in an SSE register. Using c++ and assembly/intrinsics, load the two 16 byte arrays into xmm registers, and cmp them. This generates a bitmask representing the true/false condition of the compare. You then use a movmsk instruction to load a bit representation of the bitmask into an x86 register; this then becomes a bit field where you can count all the 1's to determine how many true values you had. A hardware popcnt instruction can be a fast way to count all the 1's in a register.

This requires knowledge of assembly/intrinsics and SSE in particular. You should be able to find web resources for both.

If you run this code on a machine that does not support either SSE2 or popcnt, you must then iterate through the arrays and count the differences with your unrolled loop approach.

Good luck

Edit: Since you indicated you did not know assembly, here's some sample code to illustrate my answer:

#include "stdafx.h"
#include <iostream>
#include "intrin.h"

inline unsigned cmpArray16( char (&arr1)[16], char (&arr2)[16] )
{
    __m128i first = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr1 ) );
    __m128i second = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr2 ) );

    return _mm_movemask_epi8( _mm_cmpeq_epi8( first, second ) );
}

int _tmain( int argc, _TCHAR* argv[] )
{
    unsigned count = 0;
    char arr1[16] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0 };
    char arr2[16] = { 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0 };

    count = __popcnt( cmpArray16( arr1, arr2 ) );

    std::cout << "The number of equivalent bytes = " << count << std::endl;

    return 0;
}

Some notes: This function uses SSE2 instructions and a popcnt instruction introduced in the Phenom processor (that's the machine that I use). I believe the most recent Intel processors with SSE4 also have popcnt. This function does not check for instruction support with CPUID; the function is undefined if used on a processor that does not have SSE2 or popcnt (you will probably get an invalid opcode instruction). That detection code is a separate thread.

I have not timed this code; the reason I think it's faster is because it compares 16 bytes at a time, branchless. You should modify this to fit your environment, and time it yourself to see if it works for you. I wrote and tested this on VS2008 SP1.

SSE prefers data that is aligned on a natural 16-byte boundary; if you can guarantee that then you should get additional speed improvements, and you can change the _mm_loadu_si128 instructions to _mm_load_si128, which requires alignment.

Kent Knox
I actually need to know how many elements are equal between the two arrays.I'll study anyway the idea, and search for ASM code for this. I know nothing of ASM.Thanks :-)
Rodrigo Gómez
Thanks for the code. I don't think I'll manage to make it run on BCB, but I'll try ASAP it with a VS2008 DLL. I actually believe that even my actual code will run faster when compiled with VS2008, but I'll profile both versions.
Rodrigo Gómez
Kent: I implemented your solution, except for the __popcnt use (I replaced it for a std::bitset) and now it takes half the time! I expected a speedup, but not that much! I'll do the CPUID and try on a machine with support for that (my 1st get MacPro doesn't seem to support it). Thanks a lot!
Rodrigo Gómez
Your use of std::bitset to replace the hardware popcnt instruction is clever. You would imagine that the bitset::count function to be reasonably optimized, and covers all the processors that don't provide functionality in hardware. A hardware popcount provides additional benefits, if appropriate.
Kent Knox
Yeah, I'll actually have to create unoptimized version, doing the things the way I did before, in case this has to run on non-sse2 cpus (that I really hope it doesn't, but you never know), so I'll create 3 versions, the unoptimized, the sse2 and the sse2+popcnt. Thanks again!
Rodrigo Gómez
+1  A: 

Does this have to be platform independent, or will this code always run on the same type of CPU? If you restrict yourself to modern x86 CPUs, you may be able to use MMX instructions, which should allow you to operate on an array of 8 bytes in one clock tick. AFAIK, gcc allows you to embed assembly in your C code, and the Intel's compiler (icc) supports intrinsics, which are wrappers that allow you to call specific assembly instructions directly. Other SIMD instruction sets, such as SSE, may also be useful for this.

Dima
It doesn't have to be plataform independent, at least not for now. I know that the C++Builder compiler I'm using do allows to embed asm instructions. The problem is that I don't know ASM :-) I'll have to start studing some about it.
Rodrigo Gómez
+1  A: 

Is there any connection between the values in the arrays? Are some bytes more likely to be the same then others? Might there be some intrinsic order in the values? Then you could optimize for the most probable case.

Markus Schnell
Thanks Markus. Unfortunately, there is no likely values/positions or, at the end, probable cases. The only one was the fixed length of the arrays, 16, which is 95% or more of the cases. I still have a for loop for the other cases where the size is not 16.
Rodrigo Gómez
A: 

One extra possible optimization: if you are expecting that most of the time the arrays are identical then it might be slightly faster to do a memcmp() as the first step, setting '16' as the answer if the test returns true. If course if you are not expecting the arrays to be identical very often that would only slow things down.

DJClayworth
Thanks. Unfortunately most of the time the arrays will be different.
Rodrigo Gómez
+1  A: 

If you explain what the data actually represents then there might be a totally different way to represent the data in memory that would make this type of brute force compare unnecessary. Care to elaborate on what the data actually represents??

KPexEA