Some cryptographic algorithms, in particular hash functions (which are used in HMAC), are specified to operate on arbitrary sequence of bits. However, on actual, physical computers, and with most protocols, data is a sequence of octets: the number of bits is a multiple of eight, and bits can be processed by groups of eight bits. A group of eight bit is nominally an "octet" but the term "byte" is more often encountered. An octet has a numerical value between 0 and 255, inclusive. In some programming languages (e.g. Java), the numerical value is signed (between -128 and +127) but that is the same concept.
Note that in the context of the C programming language (as defined in the ISO 9899:1999 standard, aka "the C standard"), a byte is defined to be the elementary addressable memory unit, incarnated by the unsigned char
type. sizeof
returns a size in bytes (thus, sizeof(unsigned char)
is necessarily equal to 1). malloc()
takes a size in bytes. In C, the number of bits in a byte is specified by the CHAR_BIT
macro (defined in <limits.h>
) and is greater than or equal to eight. On most computers, there are exactly eight bits in a C byte (i.e. a C byte is an octet, and everybody calls it a "byte"). There are some systems with larger bytes (often embedded DSP) but if you had such a system you would know it.
So every cryptographic algorithm which works on arbitrary sequences of bits actually defines how the bits are internally interpreted into octets (bytes). The AES and SHA specifications go to great lengths to do that properly, even in the eyes of picky mathematicians. For every practical situation, your data is already a sequence of bytes, and the grouping of bits into bytes is assumed to have already taken place; so you just feed the bytes to the algorithm implementation and everything is fine.
Hence, in practical terms, cryptographic algorithm implementations expect a sequence of bytes as input, and produce sequences of bytes as output.
Endianness (implicitly at the byte level) is the convention on how multi-byte values (values which need several bytes to be encoded) are laid out into sequences of bytes (i.e. which byte goes first). UTF-8 is endian-neutral in that it already defines that layout: when a character is to be encoded into several bytes, UTF-8 mandates which of those bytes goes first and which goes last. This is why UTF-8 is "endian neutral": the transformation of characters into bytes is a fixed convention which does not depend upon how the local hardware likes best to read or write bytes. Endianness is most often related to how integer values are written in memory.
About cross-platform programming: There is no substitute to experience. Thus, trying on several platform is a good way. You will already learn much by making your code 64-bit clean, i.e. having the same code run properly on both 32-bit and 64-bit platforms. Any recent PC with Linux will fit the bill. Big-endian systems are now quite rare; you would need an older Mac (one with a PowerPC processor), or one of a few kinds of Unix workstations (Sparc systems, or Itanium systems under HP/UX, come to mind). Newer designs tend to adopt the little-endian convention.
About endianness in C: If your program must worry about endianness then chances are that you are doing it wrong. Endianness is about conversions of integers (16-bit, 32-bit or more) into bytes, and back. If your code worries about endianness then this means that your code writes data as integers and reads it as bytes, or vice-versa. Either way, you are doing some "type aliasing": some parts of memory are accessed via several pointers of distinct types. This is bad. Not only does it make your code less portable, but it also tends to break when asking the compiler to optimize code.
In a proper C program, endianness is handled only for I/O, when values are to be written to or read from a file or a network socket. That I/O follows a protocol which defines the endianness to use (e.g. in TCP/IP, big-endian convention is often used). The "right" way is to write a few wrapper functions:
uint32_t decode32le(const void *src)
{
const unsigned char *buf = src;
return (uint32_t)buf[0] | ((uint32_t)buf[1] << 8)
| ((uint32_t)buf[2] << 16) | ((uint32_t)buf[3] << 24);
}
uint32_t decode32be(const void *src)
{
const unsigned char *buf = src;
return (uint32_t)buf[3] | ((uint32_t)buf[2] << 8)
| ((uint32_t)buf[1] << 16) | ((uint32_t)buf[0] << 24);
}
void encode32le(void *dst, uint32_t val)
{
unsigned char *buf = dst;
buf[0] = val;
buf[1] = val >> 8;
buf[2] = val >> 16;
buf[3] = val >> 24;
}
void encode32be(void *dst, uint32_t val)
{
unsigned char *buf = dst;
buf[3] = val;
buf[2] = val >> 8;
buf[1] = val >> 16;
buf[0] = val >> 24;
}
Possibly, make those functions "static inline
" and put them in a header file, so that the compiler may inline them at will in calling code.
Then you use those functions whenever you want to write or read 32-bit integers from a memory buffer freshly obtained from (or soon to be written to) a file or socket. This will make your code endian-neutral (hence portable), and clearer, thus easier to read, develop, debug and maintain. And in the extremely rare situation where such encoding and decoding becomes a bottleneck (this may happen only if you use a platform with a very weak CPU and a very fast network connection, i.e. not a PC at all), you could still replace the implementation of those functions by some architecture specific macros, possibly with inline assembly, without modifying the rest of your code.