tags:

views:

942

answers:

6

Hello all :)

I am writing a piece of code designed to do some data compression on CLSID structures. I'm storing them as a compressed stream of 128 bit integers. However, the code in question has to be able to place invalid CLSIDs into the stream. In order to do this, I have left them as one big string. On disk, it would look something like this:

+--------------------------+-----------------+------------------------+
|                          |                 |                        |
| Length of Invalid String | Invalid String  | Compressed Data Stream |
|                          |                 |                        |
+--------------------------+-----------------+------------------------+

To encode the length of the string, I need to output the 32 bit integer that is the length of the string one byte at a time. Here's my current code:

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.push_back((BYTE)  invalidLength        & 0x000000FF);
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8));

This code won't be called often, but there will need to be a similar structure in the decoding stage called many thousands of times. I'm curious if this is the most efficient method or if someone can come up with one better?

Thanks all!

Billy3

EDIT: After looking over some of the answers, I created this mini test program to see which was the fastest:

// temp.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <ctime>
#include <iostream>
#include <vector>

void testAssignedShifts();
void testRawShifts();
void testUnion();

int _tmain(int argc, _TCHAR* argv[])
{
    std::clock_t startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testAssignedShifts();
    }
    std::clock_t assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testRawShifts();
    }
    std::clock_t rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testUnion();
    }
    std::clock_t unionFinishedTime = std::clock();
    std::printf(
     "Execution time for assigned shifts: %08u clocks\n"
     "Execution time for raw shifts:      %08u clocks\n"
     "Execution time for union:           %08u clocks\n\n",
     assignedShiftsFinishedTime - startTime,
     rawShiftsFinishedTime - assignedShiftsFinishedTime,
     unionFinishedTime - rawShiftsFinishedTime);
    startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testAssignedShifts();
    }
    assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testRawShifts();
    }
    rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
     testUnion();
    }
    unionFinishedTime = std::clock();
    std::printf(
     "Execution time for assigned shifts: %08u clocks\n"
     "Execution time for raw shifts:      %08u clocks\n"
     "Execution time for union:           %08u clocks\n\n"
     "Finished. Terminate!\n\n",
     assignedShiftsFinishedTime - startTime,
     rawShiftsFinishedTime - assignedShiftsFinishedTime,
     unionFinishedTime - rawShiftsFinishedTime);

    system("pause");
    return 0;
}

void testAssignedShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE)  invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
}
void testRawShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE) invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>  8));
    compressedBytes.push_back((BYTE) (invalidLength >>  16));
    compressedBytes.push_back((BYTE) (invalidLength >>  24));
}

typedef union _choice
{
    DWORD dwordVal;
    BYTE bytes[4];
} choice;

void testUnion()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    choice invalidLength;
    invalidLength.dwordVal = (DWORD) invalidClsids.length();
    compressedBytes.push_back(invalidLength.bytes[0]);
    compressedBytes.push_back(invalidLength.bytes[1]);
    compressedBytes.push_back(invalidLength.bytes[2]);
    compressedBytes.push_back(invalidLength.bytes[3]);
}

Running this a few times results in:

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012578 clocks
Execution time for union:           00013172 clocks

Execution time for assigned shifts: 00012594 clocks
Execution time for raw shifts:      00013140 clocks
Execution time for union:           00012782 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012515 clocks
Execution time for union:           00012531 clocks

Execution time for assigned shifts: 00012391 clocks
Execution time for raw shifts:      00012469 clocks
Execution time for union:           00012500 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012562 clocks
Execution time for union:           00012422 clocks

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012407 clocks
Execution time for union:           00012468 clocks

Looks to be about a tie between assigned shifts and union. Since I'm going to need the value later, union it is! Thanks!

Billy3

+6  A: 

This is probably as optimized as you'll get. Bit-twiddling operations are some of the fastest available on the processor.

It may be faster to >> 16, >> 24 instead of >>= 8 >>= 8 - you cut down an assignment.

Also I don't think you need the & - since you're casting to a BYTE (which should be a 8-bit char) it'll get truncated down appropriately anyway. (Is it? correct me if I'm wrong)

All in all, though, these are really minor changes. Profile it to see if it actually makes a difference :P

v3
SoapBox
This deserves a +1 as it's the only endian-independent approach, which is in general the best way to do things.
j_random_hacker
A: 

Do you have to do it one byte at a time? Is there a way you could just memcpy() the whole 32 bits into the stream in one fell swoop? If you have the address of the buffer you're writing to the stream, can you just copy into that?

Furious Coder
Unfortunately yes. I put up the vector as sample code for demonstration ... but what I'm putting it through wants byte-by-byte representation.
Billy ONeal
+5  A: 

Just use a union:

assert(sizeof (DWORD) == sizeof (BYTE[4]));   // Sanity check

union either {
    DWORD dw;
    struct {
         BYTE b[4];
    } bytes;
};

either invalidLength;
invalidLength.dw = (DWORD) invalidClsids.length();
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

NOTE: Unlike the bit-shifting approach in the original question, this code produces endian-dependent output. This matters only if output from a program running on one computer will be read on a computer with different endianness -- but as there seems to be no measurable speed increase from using this method, you might as well use the more portable bit-shifting approach, just in case.

j_random_hacker
The impression I've gotten is that its wrong to use unions as casts. Unions are (rarely) best used as ways to save on memory by not storing two large items in memory when you only ever use one at a time.
Ed
Also, be sure you understand the endianess issues if this code might ever be ported...
Michael Burr
Endianness doesn't really matter because even if it's written backwards, it will be read backwards as well.
Billy ONeal
@Ed: Actually you have it round the wrong way: the C++ standard specifically allows the "reinterpretation" of the contents of a union (3.10.15) -- what *does* lead to undefined behaviour is casting a pointer to X to a pointer to Y (where X and Y are unrelated types) and then dereferencing it.
j_random_hacker
@Ed: Casting an X* to a Y* where X and Y are unrelated types breaks C++'s aliasing assumptions, namely that a Y object can only be pointed to by pointers of static type Y* (or pointers to a base classes of Y). This assumption is important for generating optimised code.
j_random_hacker
@Michael: Very good point about endianness. The code is non-portable in that respect -- to produce the same result as the OP's code, it needs to be run on a little-endian machine.
j_random_hacker
@BillyONeal - but if it gets written on one system and read on another with different byte ordering it'll be broken. I'd at least throw a compile-time assert to test endianess so if someone decides a port is a good idea a couple years from now, the compiler will tell them about the problem.
Michael Burr
@BillyONeal: Sort-of. The endianness-dependence *matters* only if the output is read on a machine with different endianness, which may be unlikely in this case (CLSIDs somewhat implies Windows, which runs only on little-endian CPUs these days).
j_random_hacker
@j_random_hacker: This code is for windows boxes... little endian only anyway. I know one wants to strive for portable code, but the program in question is a windows utility which will make little sense outside of windows because it's reporting on windows' internals.... Thanks again!
Billy ONeal
This isn't necessarily as fast as you might expect. Unions mean the compiler has to take aliasing into account, which may prevent a number of optimizations. Like Pax said, measure, instead of guessing.
jalf
@jalf: I totally agree with "measure everything." But in this case I think that vector reallocation time is likely to dwarf any differences -- the focus should be on figuring out the number of BYTES and reserve()ing that ahead of time.
j_random_hacker
@j_random_hacker For clarification, "wrong" was in terms of best practices and not in terms of the standard.
Ed
@Ed: In your opinion, what is the best-practice way of doing this, and why is considered better than using unions? (I'm a bit disinclined to choose a non-standard approach over a standard-endorsed approach, but if you could tell me the reason why I might change my mind...)
j_random_hacker
+1  A: 

A real quick way is to just treat the a DWORD* (single element array) as a BYTE* (4 element array). The code is also a lot more readable.

Warning: I haven't compiled this

Warning: This makes your code dependent on byte ordering

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
BYTE* lengthParts = &invalidLength;
static const int kLenghtPartsLength = sizeof(DWORD) / sizeof(BYTE);
for(int i = 0; i < kLenghtPartsLength; ++i)
    compressedBytes.push_back(lengthParts[i]);
Ed
Note: this is has the same result as the "unions" answer given by i_random_hacker.
SoapBox
+1. Provided that BYTE is some sort of (possibly signed or unsigned) char type, the C++ standard guarantees undefined behaviour won't occur here -- though just as with unions, it is still *implementation-defined* behaviour (which, as you point out, effectively means byte-order dependence here).
j_random_hacker
@SoapBox the result in assembly might be the same but as I mentioned in the j_random_hacker solution, I have a bias against using unions as casts.
Ed
+2  A: 

You should measure rather than guess at any potential improvement but my first thought is that it may be faster to do a union as follows:

typedef union {
    DWORD d;
    struct {
        BYTE b0;
        BYTE b1;
        BYTE b2;
        BYTE b3;
    } b;
} DWB;

std::vector<BYTE> compBytes;
DWB invLen;
invLen.d = (DWORD) invalidClsids.length();
compBytes.push_back(invalidLength.b.b3);
compBytes.push_back(invalidLength.b.b2);
compBytes.push_back(invalidLength.b.b1);
compBytes.push_back(invalidLength.b.b0);

That may be the right order for the pushbacks but check just in case - it depends on the endian-ness of the CPU.

paxdiablo
This code has the potential to be better than the other union answer given, because you can reverse the other of b0-b3 using ifdefs and this code will then produce the same output on little and big endian machines.
SoapBox
+1. Totally agree with "measure everything." In this case, my guess is that vector reallocations will utterly dwarf any difference between bit-shifts and unions -- so if you could figure out how many bytes compBytes will hold ahead of time and reserve() that much, that's where you might win big.
j_random_hacker
@Soapbox: Cool idea! OTOH, formally speaking, a standard-conforming compiler may introduce arbitrary amounts of padding in between each BYTE in DWB.b, while it is not permitted to do so for an array of 4 BYTEs. But that's a minor quibble; in practice, every compiler has a mechanism to pack structs.
j_random_hacker
+1  A: 
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

There is an even smarter and faster way! Let's see what this code is doing and how we can improve it.

This code is serializing the integer, one byte at a time. For each byte it's calling push_back, which is checking the free space in the internal vector buffer. If we have no room for another byte, memory reallocation will happen (hint, slow!). Granted, the reallocation will not happen frequently (reallocations typically happen by doubling the existing buffer). Then, the new byte is copied and the internal size is increased by one.

vector<> has a requirement by the standard which dictates that the internal buffer be contiguous. vector<> also happen to have an operator& () and operator[] ().

So, here is the best code you can come up with:

std::string invalidClsids("This is a test string");
std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.resize(sizeof(DWORD)); // You probably want to make this much larger, to avoid resizing later.
// compressedBytes is as large as the length we want to serialize.
BYTE* p = &compressedBytes[0]; // This is valid code and designed by the standard for such cases. p points to a buffer that is at least as large as a DWORD.
*((DWORD*)p) = invalidLength;  // Copy all bytes in one go!

The above cast can be done in one go with the &compressedBytes[0] statement, but it won't be faster. This is more readable.

NOTE! Serializing this way (or even with the UNION method) is endian-dependent. That is, on an Intel/AMD processor the least significant byte will come first, while one a big-endian machine (PowerPC, Motorola...) the most significant byte will come first. If you want to be neutral, you must use a math method (shifts).

Ash
+1. Yes, believe it or not this isn't any more "non-standard" than loading the bytes from a union one at a time, and almost certainly faster (but again: measure everything).
j_random_hacker