views:

162

answers:

4

I have an algorithm that uses the following OpenSSL calls:

HMAC_update() / HMAC_final() // ripe160
EVP_CipherUpdate() / EVP_CipherFinal() // cbc_blowfish

These algorithm take a unsigned char * into the "plain text". My input data is comes from a C++ std::string::c_str() which originate from a protocol buffer object as a encoded UTF-8 string. UTF-8 strings are meant to be endian neutrial. However I'm a bit paranoid about how OpenSSL may perform operations on the data.

My understanding is that encryption algorithms work on 8-bit blocks of data, and if a unsigned char * is used for pointer arithmetic when the operations are performed the algorithms should be endian neutral and I do not need to worry about anything. My uncertainty is compounded by the fact that I am working on a little-endian machine and have never done any real cross-architecture programming.

My beliefs/reasoning are/is based on the following two properties

  1. std::string (not wstring) internally uses a 8-bit ptr and a the resulting c_str() ptr will itterate the same way regardless of the CPU architecture.
  2. Encryption algorithms are either by design, or by implementation, endian neutral.

I know the best way to get a definitive answer is to use QEMU and do some cross-platform unit tests (which I plan to do). My question is a request for comments on my reasoning, and perhaps will assist other programmers when faced with similar problems.

+2  A: 

Seems the real questions here is:

"Can I be sure my encoded UTF-8 string will be represented internaly in the same way on different computers ?"

Because, as you stated, OpenSSL routines don't really take care of this (nor do they have to know).

As you only request for comments, I think you should be fine. OpenSSL routines should behave the same way, for two identical chunks of data, whatever the computer architecture.

ereOn
A: 

One way to be sure of endianess is to follow the IP standard of network byte order.

Take a look here for the functions you need. These should be available on Windows and *nix with modern C++ implementations.

However, I believe your reasoning is correct, and you should not have to worry about it in this case.

Edit: To be clear, the network byte order comment is assuming you are then sending the data and are worried about how it will be received on the other end. If the send and receive are all on the same machine, there should be no problem.

Adam W
+6  A: 

A UTF-8 string and a std::string are both defined as a sequence of chars. Crypto algorithms are defined to operate on a sequence of bytes/octets (in C bytes are the same a chars, and if your byte isn't an octet then you're on an unusual implementation, and you might have to be a bit careful dealing with UTF-8). The only sensible way to represent a sequence of bytes in contiguous memory is with the first one at the lowest address, and subsequent ones at higher addresses (a C array). Crypto algorithms don't care what the bytes represent, so you're fine.

Endian-ness only matters when you're dealing with something like an int, which isn't inherently a sequence of bytes. In the abstract, it's just "something" which holds values INT_MIN to INT_MAX. When you come to represent such a beast in memory, of course it has to be as a number of bytes, but there's no single way to do it.

In practice, endian-ness is important in C if you (perhaps via something you call) reinterpret a char* as an int*, or vice-versa, or define a protocol in which an int is represented using a sequence of chars. If you're only dealing with arrays of chars, or only dealing with arrays of ints, it's irrelevant, because endianness is a property of ints and other types bigger than char.

Steve Jessop
+3  A: 

Some cryptographic algorithms, in particular hash functions (which are used in HMAC), are specified to operate on arbitrary sequence of bits. However, on actual, physical computers, and with most protocols, data is a sequence of octets: the number of bits is a multiple of eight, and bits can be processed by groups of eight bits. A group of eight bit is nominally an "octet" but the term "byte" is more often encountered. An octet has a numerical value between 0 and 255, inclusive. In some programming languages (e.g. Java), the numerical value is signed (between -128 and +127) but that is the same concept.

Note that in the context of the C programming language (as defined in the ISO 9899:1999 standard, aka "the C standard"), a byte is defined to be the elementary addressable memory unit, incarnated by the unsigned char type. sizeof returns a size in bytes (thus, sizeof(unsigned char) is necessarily equal to 1). malloc() takes a size in bytes. In C, the number of bits in a byte is specified by the CHAR_BIT macro (defined in <limits.h>) and is greater than or equal to eight. On most computers, there are exactly eight bits in a C byte (i.e. a C byte is an octet, and everybody calls it a "byte"). There are some systems with larger bytes (often embedded DSP) but if you had such a system you would know it.

So every cryptographic algorithm which works on arbitrary sequences of bits actually defines how the bits are internally interpreted into octets (bytes). The AES and SHA specifications go to great lengths to do that properly, even in the eyes of picky mathematicians. For every practical situation, your data is already a sequence of bytes, and the grouping of bits into bytes is assumed to have already taken place; so you just feed the bytes to the algorithm implementation and everything is fine.

Hence, in practical terms, cryptographic algorithm implementations expect a sequence of bytes as input, and produce sequences of bytes as output.

Endianness (implicitly at the byte level) is the convention on how multi-byte values (values which need several bytes to be encoded) are laid out into sequences of bytes (i.e. which byte goes first). UTF-8 is endian-neutral in that it already defines that layout: when a character is to be encoded into several bytes, UTF-8 mandates which of those bytes goes first and which goes last. This is why UTF-8 is "endian neutral": the transformation of characters into bytes is a fixed convention which does not depend upon how the local hardware likes best to read or write bytes. Endianness is most often related to how integer values are written in memory.

About cross-platform programming: There is no substitute to experience. Thus, trying on several platform is a good way. You will already learn much by making your code 64-bit clean, i.e. having the same code run properly on both 32-bit and 64-bit platforms. Any recent PC with Linux will fit the bill. Big-endian systems are now quite rare; you would need an older Mac (one with a PowerPC processor), or one of a few kinds of Unix workstations (Sparc systems, or Itanium systems under HP/UX, come to mind). Newer designs tend to adopt the little-endian convention.

About endianness in C: If your program must worry about endianness then chances are that you are doing it wrong. Endianness is about conversions of integers (16-bit, 32-bit or more) into bytes, and back. If your code worries about endianness then this means that your code writes data as integers and reads it as bytes, or vice-versa. Either way, you are doing some "type aliasing": some parts of memory are accessed via several pointers of distinct types. This is bad. Not only does it make your code less portable, but it also tends to break when asking the compiler to optimize code.

In a proper C program, endianness is handled only for I/O, when values are to be written to or read from a file or a network socket. That I/O follows a protocol which defines the endianness to use (e.g. in TCP/IP, big-endian convention is often used). The "right" way is to write a few wrapper functions:

uint32_t decode32le(const void *src)
{
    const unsigned char *buf = src;
    return (uint32_t)buf[0] | ((uint32_t)buf[1] << 8)
        | ((uint32_t)buf[2] << 16) | ((uint32_t)buf[3] << 24);
}

uint32_t decode32be(const void *src)
{
    const unsigned char *buf = src;
    return (uint32_t)buf[3] | ((uint32_t)buf[2] << 8)
        | ((uint32_t)buf[1] << 16) | ((uint32_t)buf[0] << 24);
}

void encode32le(void *dst, uint32_t val)
{
    unsigned char *buf = dst;
    buf[0] = val;
    buf[1] = val >> 8;
    buf[2] = val >> 16;
    buf[3] = val >> 24;
}

void encode32be(void *dst, uint32_t val)
{
    unsigned char *buf = dst;
    buf[3] = val;
    buf[2] = val >> 8;
    buf[1] = val >> 16;
    buf[0] = val >> 24;
}

Possibly, make those functions "static inline" and put them in a header file, so that the compiler may inline them at will in calling code.

Then you use those functions whenever you want to write or read 32-bit integers from a memory buffer freshly obtained from (or soon to be written to) a file or socket. This will make your code endian-neutral (hence portable), and clearer, thus easier to read, develop, debug and maintain. And in the extremely rare situation where such encoding and decoding becomes a bottleneck (this may happen only if you use a platform with a very weak CPU and a very fast network connection, i.e. not a PC at all), you could still replace the implementation of those functions by some architecture specific macros, possibly with inline assembly, without modifying the rest of your code.

Thomas Pornin
A very thorough and effective treatment on the matter, your answer will enlighten and bolster the knowledge of many C/C++ programmers on this area of portability concerns.
Hassan Syed