views:

474

answers:

2

If I use mmap to write uint32_t's, will I run into issues with big endian/little endian conventions? In particular, if I write some data mmap'ed on a big-endian machine, will I run into issues when I try to read that data on a little-endian machine?

+2  A: 

Yes.

mmap maps raw file data to process address space. It does not know anything about what the raw data represents, let alone try to convert it for you. If you are mapping the same file on architectures with different endianness, you'll have to do any necessary conversion yourself.

As a portable data format across computers, I'd consider something of higher abstraction level such as JSON or even XML that does not tie the data format to a particular implementation. But it really depends on your specific requirements.

laalto
+4  A: 

If you're using mmap, your probably concerned about speed and efficiency. You basically have a few choices.

  1. Wrap all your reads and writes with htonl, htons, ntohl, ntohs functions. Calling htonl (host to network) order on Windows will convert the data from little endian to big endian. On other architectures it will be a noop. These conversions do have an overhead, but depending on your operations, they may or may not be significant. AFAIK, this is the approach used by SQLite
  2. Your other option is to always write data in host format, and provide routines if users need to migrate their data across platforms. Databases usually read and write data in host format, but provide tools like bcp which will write to either ASCII or network byte order.
  3. You can tag the header of your file with a byte order mark. When your program starts, it will compare it's byte order with that of the file, and provide any translation if needed. This is often good for simply data formats like UTF-16, but not for formats where you have a number of variable length types.

Additionally, if you do things like provide length prefixes, or file offsets, you may have a mixture of 32 bit and 64 bit pointers. A 32 bit platform can't create a mmap view larger than 4GB, so it's unlikely that you would support file sizes larger than 4 GB. Programs like rrdtool take this approach, and support much larger file sizes on 64 bit platforms. This means your binary file wouldn't be compatible across platforms if you used the platform pointer size inside of your file.

My recommendation is to ignore all byte order issues up front, and design the system to run fast on your platform. If/when you need to move your data to another platform, then choose the easiest/quickest/most appropriate method of doing so. If you start out by trying to create a platform independent data format, you will generally make mistakes, and have to go back and fix those mistakes later. This is especially problematic when 99% of the data is in the correct byte order, and 1% of it is wrong. This means fixing bugs in your data translation code will break existing clients on all platforms.

You'll want to have a multi-platform test setup before writing code to support more than one platform.

brianegge
We have similar issues except that we decided that Intel byte ordering was the most natural way to store data: almost all our customers run Linux (Intel) servers or Windows (Intel of course) servers. Big endian is going out of fashion.
Tim Cooper