tags:

views:

534

answers:

8

This isn't a question specific to any programming language. Say you have some file written on a big-endian machine, and you know this. If two single-byte values were written back-to-back, how would you know? Big-endian reverses the order of 16, 32, and 64 bit values, so how would you know you need to read it as individual bytes?

For instance, you write the byte 0x11, then the byte 0x22. The file then contains 0x1122. If you read that on a little endian machine, you'd have to convert it. So would you read it as 2211, or 1122? Would you know how?

Does this make any sense? I feel like I'm missing something super basic here.

+2  A: 

You need to either divine it because you know something else (i.e., you know you are reading a file in big endian format) or you need to encode the endianness in the file somehow. Unicode text files use 0xFFFE (or something similar) as the first two bytes of a text file to calculate endianness. If you read it as 0xfffe, then it's in the native endian format. If you read it as 0xfeff, it's not.

MSN

MSN
I definitely know the endianness of the machine that wrote the file, and I definitely know the endianness of the machine that is attempting to read the file.
yodaj007
A: 

The processor operates in one or the other endian mode (some can switch based on pages, etc). They don't know if they're doing the right thing or not. They just do what they do. (Garbage In, Garbage Out) :-)

Brian Knoblauch
+1  A: 

You're exactly right...without some idea of the data you're looking at, there's no way to know.

That being said, there are often ways to guess...if you know you're supposed to be seeing text, you could run some simple tests to see if what you're getting is reasonable...if you can read a header out, you can often divine it from that...but if you're just looking at a stream of bytes, there's no surefire way to know.

Beska
A: 

There's no way to detect i'd say. But in C# the BitConverter has a IsLittleEndian-propertie.

It all depends on how you Want to enterpret it.

Read more here.

Filip Ekberg
+1  A: 

Does this make any sense?

Yes: it's a problem.

I feel like I'm missing something super basic here.

Basically, to read a file (especially a binary file) you need to know the file format: which includes knowing whether a pair of bytes is a sequence of individual bytes, or, is a single double-byte word.

ChrisW
+5  A: 

There is no way to know. This is why formally specified file formats typically mandate an endianness, or they provide an option (as with unicode, as MSN mentioned). This way, if you are reading a file with a particular format, you know it's big-endian already, because the fact that it's in that format implies a particular endianness.

Another good example of this is network byte order -- network protocols are typically big-endian, so if you're a little-endian processor talking to the internet, you have to write things backwards. If you're big-endian, you don't need to worry about it. People use functions like htonl and ntohl to preprocess things they write to the network so that their source code is the same on all machines. These functions are defined to do nothing on big-endian machines, but they flip bytes on little-endian machines.

The key realization is that endianness is a property of how particular architectures represent words. It's not a mandate that they have to write files a certain way; it just tells you that the instructions on the architecture expect multi-byte words to have their bytes ordered a certain way. A big-endian machine can write the same byte sequence as a little-endian machine, it just might use a few more instructions to do it, because it has to reorder the bytes. The same is true for little-endian machines writing big-endian formats.

tgamblin
+1  A: 

You are not missing anything. Well defined binary file formats (such as Excel 97-2003 xls workbooks for example) must include the endianness as part of the specification or you will obviously have big problems.

Historically, the Macintosh used Motorola processors (the 68000 and it's successors) which were big-endian, while IBM PC / DOS / Windows computers have always used Intel processors which are little-endian. So software vendors with C / C++ code bases which run on both platforms are very familar with this issue, while software vendores who have always developed Windows software, or Mac software before Apple switched to Intel, might have simply ignored it - at least for their own file formats.

Joe Erickson
Don't forget early Windows NT on Alpha and MIPS, and Apple having PowerPC in there in the middle.
crashmstr
Good point. I overlooked them because none of the projects I've worked on were ported to non-Intel Windows and I have been out of Mac development since before the PowerPC.
Joe Erickson
A: 

Not sure if this is exactly what you're asking, but, for example, the PCAP file format specifies a variable endianness:

http://www.winpcap.org/ntar/draft/PCAP-DumpFileFormat.html

The concept is that you can write a "marker" byte, such as 0x12345678, to the header of your file. On a "little endian" machine such as a PowerPC, it will be written as follows:

0x12 0x34 0x56 0x78

On a "big endian" machine such as an x86, it will be written as follows:

0x78 0x56 0x34 0x12

Then, when reading your header, you could tell by what your machine read out to determine if you need to swap bytes while reading the file. Or you could specify an endianness, such as big endian. Then you would always swap bytes in on a little endian machine.

In the case of the PCAP format, this was done for performance reasons. But it's probably simpler to specify and endianness and stick to it.

Mike
PowerPC is usually big endian, and x86 is little endian.
ShaChris23