views:

5416

answers:

14

Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?

How can I make this faster?

Compiled with -O3 in g++

static const char _hex2asciiU_value[256][2] =
     { {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };

std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    std::string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;

    clock_t stick, etick;
    stick = clock();
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
        pszHex[0] = _hex2asciiU_value[*pChar][0];
        pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    etick = clock();

    std::cout << "ticks to hexify " << etick - stick << std::endl;

    return str;
}

Updates

Added timing code

Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms

Antti Sykäri:replace inner loop with

 int upper = *pChar >> 4;
 int lower = *pChar & 0x0f;
 pszHex[0] = pHex[upper];
 pszHex[1] = pHex[lower];

result ~8ms

Robert: replace _hex2asciiU_value with a full 256-entry table, sacrificing memory space but result ~7ms!

HoyHoy: Noted it was producing incorrect results

+2  A: 

For one, instead of *16 do a bitshift << 4

Also don't use the stl string, instead just create a buffer on the heap and then delete it. It will be more efficient than the object destruction that is needed from the string.

Brian R. Bondy
The compiler should do that for you. Bitshifting instead of multiplying makes the code far less human-readable.
yjerem
It's not like his code is void of other bit shifts...And there is no guarantee that his compiler will do this. And his specific question was about making it more efficient.
Brian R. Bondy
There's no guarantee that the multiply is faster than shift either.
nos
+3  A: 

Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with urlencoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.

/Allan

Allan Wind
+1  A: 

not going to make a lot of difference... pChar-(ofs16) can be done with [*pCHar & 0x0F]

Keith Nicholas
+1  A: 

This is my version, which, unlike the OP's version, doesn't assume that std::basic_string has its data in contiguous region:

#include <string>

using std::string;

static char const* digits("0123456789ABCDEF");

string
tohex(string const& data)
{
    string result(data.size() * 2, 0);
    string::iterator ptr(result.begin());
    for (string::const_iterator cur(data.begin()), end(data.end()); cur != end; ++cur) {
        unsigned char c(*cur);
        *ptr++ = digits[c >> 4];
        *ptr++ = digits[c & 15];
    }
    return result;
}
Chris Jester-Young
A: 

Make sure your compiler optimization is turned on to the highest working level.

You know, flags like '-O1' to '-03' in gcc.

+4  A: 

At the cost of more memory you can create a full 256-entry table of the hex codes:

static const char _hex2asciiU_value[256][2] =
    { {'0','0'}, {'0','1'}, /* ..., */ {'F','E'},{'F','F'} };

Then direct index into the table, no bit fiddling required.

const char *pHexVal = pHex[*pChar];
pszHex[0] = pHexVal[0];
pszHex[1] = pHexVal[1];
Robert
+1  A: 
Antti Sykäri
A: 

I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.

Mark Ransom
seems dubious! An index into an array *is* just pointer arithmetic!
Mitch Wheat
A: 

The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.

#include <iostream>

using namespace std;

static const size_t _h2alen = 256;
static char _hex2asciiU_value[_h2alen][3];

string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;
    const char* pHex = _hex2asciiU_value[0];
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
       pszHex[0] = _hex2asciiU_value[*pChar][0];
       pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    return str;
}


int main() {
  for(int i=0; i<_h2alen; i++) {
    snprintf(_hex2asciiU_value[i], 3,"%02X", i);
  }
  size_t len = 200000000;
  char* a = new char[len];
  string t1;
  string t2;
  clock_t start;
  srand(time(NULL));
  for(int i=0; i<len; i++) a[i] = rand()&0xFF;
  start = clock();
  t1=char_to_hex((const unsigned char*)a, len);
  cout << "char_to_hex conversion took ---> " << (clock() - start)/(double)CLOCKS_PER_SEC << " seconds\n";
}
hoyhoy
>200,000,000 <s>million</s> characters
Josh
A: 

If you're rather obsessive about speed here, you can do the following:

Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.

So, you can do the following:

  1. Unpack the four-bit values to 8-bit values using a multiplication or similar instruction.
  2. Use pshufb, the SSSE3 instruction (Core2-only though). It takes an array of 16 8-bit input values and shuffles them based on the 16 8-bit indices in a second vector. Since you have only 16 possible characters, this fits perfectly; the input array is a vector of 0 through F characters, and the index array is your unpacked array of 4-bit values.

Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).

So, in computational steps:

  1. A B C D E F G H I J K L M N O P (64-bit vector of input values, "Vector A") -> 0A 0B 0C 0D 0E 0F 0G 0H 0I 0J 0K 0L 0M 0N 0O 0P (128-bit vector of indices, "Vector B"). The easiest way is probably two 64-bit multiplies.
  2. pshub [0123456789ABCDEF], Vector B
Dark Shikari
A: 

I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.

What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.

slicedlime
More bytes at a time ought to work great, up to the system's word size
Josh
A: 

Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)

for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) {
    const char* pchars = _hex2asciiU_value[*pChar];
    *pszHex++ = *pchars++;
    *pszHex++ = *pchars;
}
+2  A: 
hoyhoy
+3  A: 

This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.

%include "x86inc.asm"

SECTION_RODATA
pb_f0: times 16 db 0xf0
pb_0f: times 16 db 0x0f
pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70

SECTION .text

; int convert_string_to_hex( char *input, char *output, int len )

cglobal _convert_string_to_hex,3,3
    movdqa xmm6, [pb_f0 GLOBAL]
    movdqa xmm7, [pb_0f GLOBAL]
.loop:
    movdqa xmm5, [pb_hex GLOBAL]
    movdqa xmm4, [pb_hex GLOBAL]
    movq   xmm0, [r0+r2-8]
    movq   xmm2, [r0+r2-16]
    movq   xmm1, xmm0
    movq   xmm3, xmm2
    pand   xmm0, xmm6 ;high bits
    pand   xmm2, xmm6
    psrlq  xmm0, 4
    psrlq  xmm2, 4
    pand   xmm1, xmm7 ;low bits
    pand   xmm3, xmm7
    punpcklbw xmm0, xmm1
    punpcklbw xmm2, xmm3
    pshufb xmm4, xmm0
    pshufb xmm5, xmm2
    movdqa [r1+r2*2-16], xmm4
    movdqa [r1+r2*2-32], xmm5
    sub r2, 16
    jg .loop
    REP_RET

Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.

P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?

Dark Shikari