views:

665

answers:

11

An intern who works with me showed me an exam he had taken in computer science about endianness issues. There was a question that showed an ASCII string "My-Pizza", and the student had to show how that string would be represented in memory on a little endian computer. Of course, this sounds like a trick question because ASCII strings are not affected by endian issues.

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

I know this can't be right. There is no way an ASCII string would be represented like that on any machine. But apparently, the professor is insisting on this. So, I wrote up a small C program and told the intern to give it to his professor.

#include <string.h>
#include <stdio.h>

int main()
{
    const char* s = "My-Pizza";
    size_t length = strlen(s);
    for (const char* it = s; it < s + length; ++it) {
        printf("%p : %c\n", it, *it);
    }
}

This clearly demonstrates that the string is stored as "My-Pizza" in memory. A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order.

I told him his professor is insane, and this is clearly wrong. But just to check my own sanity here, I decided to post this on stackoverflow so I could get others to confirm what I'm saying.

So, I ask : who is right here?

+7  A: 

The professor is wrong if we're talking about a system that uses 8 bits per byte.

I often work with embedded systems that actually use 16-bit "bytes", each "byte" being little-endian. On such a system, the string "My-Pizza" would indeed be stored as "yMP-ziaz". And if the system uses 32 bits for a "byte", the string would be stored as "P-yMazzi".

But as long as it's an 8-bit-per-byte system, the string will always be stored as "My-Pizza" independent of the endian-ness of the higher-level architecture.

Dmitry Brant
What system is that?
Heath Hunnicutt
+1 Heath, I've done a lot of embedded work and never seen something weird like that.
Carl Norum
One product I've worked on uses a Texas Instruments DSP (2808, I think), whose smallest addressable unit of memory is 16 bits.
Dmitry Brant
Aha, all bets are off when it comes to DSP. How would you write the OP's program with only 16-bit addressing? Do you have to decompose the 16-bit chunks into 8-bit pieces yourself?
Carl Norum
Dmitry, that's cool about the DSP. Are you using a C compiler that has the "char *" type? What happens when you have char * p = "MyPi"; and perform "i=*p; p++; j=*p"? Can you individually address the bytes that are packed into the 16-bit words using a char * in C?
Heath Hunnicutt
A "char" in this compiler is actually 16 bits. So an ASCII string would be stored with each character taking up 16 bits, such as "M\0y\0-\0P\0 ...".So, in reality, what I wrote in my response would not happen in practice, at least for string literals. It does happen for long integers; i.e. 0x12345678 would be stored as 0x3412 0x7856.
Dmitry Brant
That seems more like what I would have expected for 16-bit minimum addressing.
Carl Norum
I've seen this too, for example, when programming on a Canon A620 digital camera using the CHDK hack. Not only is the pixel data 10 bits packed, but the data is accessed in a 16-bit little-endian format. So you have to read 2 chars, swap them, repeat a few times, and *then* unpack.
rlbond
+14  A: 

Without a doubt, you are correct.

ANSI C standard 6.1.4 specifies that string literals are stored in memory by "concatenating" the characters in the literal.

ANSI standard 6.3.6 also specifies the effect of addition on a pointer value:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression.

If the idea attributed to this person were correct, then the compiler would also have to monkey around with integer math when the integers are used as array indices. Many other fallacies would also result which are left to the imagination.

The person may be confused, because (unlike a string initializer), multi-byte chacter constants such as 'ABCD' are stored in endian order.

There are many reasons a person might be confused about this. As others have suggested here, he may be misreading what he sees in a debugger window, where the contents have been byte-swapped for readability of int values.

Heath Hunnicutt
It may be that the professor is looking at memory in his debugger in a 32-bit mode and is confused by the endianness?
Carl Norum
This is all just a communication gap due to so few people having seen an actual dump and the fact that no one here recognizes that you have to print one thousand as 1,000, not 000,1. This totally wrong answer has 8 votes from equally confused readers...
DigitalRoss
"Totally wrong"? Be more specific?
Heath Hunnicutt
+1  A: 

I assume the professor was trying to make a point by analogy about the endian/NUXI problem, but you're right when you apply it to actual strings. Don't let that derail from the fact that he was trying to teach students a point and how to think about a problem a certain way.

Dinah
Teaching someone a "point" by telling lies isn't teaching *anything*. That's **horrible**, don't let him get away with it.
Carl Norum
+11  A: 

The professor is confused. In order to see something like 'P-yM azzi' you need to take some memory inspection tool that displays memory in '4-byte integer' mode and at the same time gives you a "character interpretation" of each integer in higher-order byte to lower-order byte mode.

This, of course, has nothing to do with the string itself. And to say that the string itself is represented that way on a little-endian machine is utter nonsense.

AndreyT
OK, @AndreyT, I think I need your help on this one. As usual, you are right, but could it be: that's exactly what the prof meant? I have a feeling the SO crowd has lurched in the wrong direction on this one...
DigitalRoss
Hmm... Maybe, but what would be the "correct" answer in this case? If one inspects little-endian memory as a sequence of bytes, one'd see 'My-Pizza' in there. If one interpret it as a sequence of 2-byte ints, it would be 'yM P- zi az'. In case of 4-byte ints it's 'P-yM azzi'. And finally a 8-byte int interpretation would give 'azziP-yM'. All these "interpretations" are just that - interpretations, ways to *display* data in memory. All of them are "correct", once one understands where they come from. Nothing gives the professor the basis to *insist* on just one of them as the "right" one.
AndreyT
It makes very little sense for a debugger to say "This integer, if stored on a machine with different endianness, would represent this different string in memory".
caf
+7  A: 

You can quite easily prove that the compiler is doing no such "magic" transformations, by doing the printing in a function that doesn't know it's been passed a string:

int foo(const void *mem, int n)
{
    const char *cptr, *end;
    for (cptr = mem, end = cptr + n; cptr < end; cptr++)
        printf("%p : %c\n", cptr, *cptr);
}

int main()
{
    const char* s = "My-Pizza";

    foo(s, strlen(s));
    foo(s + 1, strlen(s) - 1);
}

Hell, you can even compile to assembly with gcc -S and conclusively determine the absence of magic.

caf
+1 for ASM. Also, you can write this routine IN assembly just to prove it.
Paul Nathan
+1 for assembly, I went back and linked to this answer from http://stackoverflow.com/questions/1565567/in-which-scenario-it-is-useful-to-use-disassembly-language-while-debugging/1565590#1565590
sharptooth
+1  A: 

You may be interested, it is possible to emulate a little-endian architecture on a big-endian machine, or vice-versa. The compiler has to emit code which auto-magically messes with the least significant bits of char* pointers whenever it dereferences them: on a 32bit machine you'd map 00 <-> 11 and 01 <-> 10.

So, if you write the number 0x01020304 on a big-endian machine, and read back the "first" byte of that with this address-munging, then you get the least significant byte, 0x04. The C implementation is little-endian even though the hardware is big-endian.

You need a similar trick for short accesses. Unaligned accesses (if supported) may not refer to adjacent bytes. You also can't use native stores for types bigger than a word because they'd appear word-swapped when read back one byte at a time.

Obviously however, little-endian machines do not do this all the time, it's a very specialist requirement and it prevents you using the native ABI. Sounds to me as though the professor thinks of actual numbers as being "in fact" big-endian, and is deeply confused what a little-endian architecture really is and/or how its memory is being represented.

It's true that the string is "represented as" P-yM azzi on 32bit l-e machines, but only if by "represented" you mean "reading the words of the representation in order of increasing address, but printing the bytes of each word big-endian". As others have said, this is what some debugger memory views might do, so it is indeed a representation of the contents of the memory. But if you're going to represent the individual bytes, then it is more usual to list them in order of increasing address, no matter whether words are stored b-e or l-e, rather than represent each word as a multi-char literal. Certainly there is no pointer-fiddling going on, and if the professor's chosen representation has led him to think that there is some, then it has misled him.

Steve Jessop
What!? Name me one such compiler that emits these automagic codes the munge the bottom two bits of every pointer access everywhere.
Adam Rosenfield
I have specialized library functions for doing this on the 1 in 10 million case this is actually correct.
Joshua
@Adam: not strictly the compiler, but the so-called "translator", which you can consider like a compiler back-end, for Tao Group's now sadly defunct "intent". The intent environment was always little-endian, even on big-endian hardware. This made implementing network drivers a little confusing, since intent code had one endianness, and inline native assembler the opposite. And as I specifically stated, it did not munge every pointer access, it only munged non word-size pointer access. Made it easier for writers of portable apps to test, because they didn't need a b-e platform to hand.
Steve Jessop
The more important goal, though, was that intent had a virtual assembler language and byte code, which in order to be portable needed to have a consistent endian-ness, consistent sizes of builtin types, etc. It was then up to the translator to make this work on a given platform.
Steve Jessop
A: 

Also, (And I haven't played with this in a long time, so I might be wrong) He might be thinking of pascol, where strings are represented as "packed arrays" which, IIRC are characters packed into 4 byte integers?

Brian Postow
+1  A: 

It's hard to read the prof's mind and certainly the compiler is not doing anything other than storing bytes to adjacent increasing addresses on both BE and LE systems, but it is normal to display memory in word-sized numbers, for whatever the word size is, and we write one thousand as 1,000. Not 000,1.

$ cat > /tmp/pizza
My-Pizza^D
$ od -X /tmp/pizza
0000000 502d794d 617a7a69
0000010
$

For the record, y == 79, M == 4d.

DigitalRoss
Actually, such a format is pretty standard. A 32-bit dump with ASCII alongside in my ARM debugger shows me the 32-bit words in the right (logical) order, but the ASCII dump is in bytewise order.
Carl Norum
Agreed, but also kinda my point. You needed two dumps with opposite paradigms, printed side-by-side.
DigitalRoss
I should add that of course I can't read the prof's mind. But I'm a bit shocked that an interpretation that made the prof's point perfectly valid didn't seem to occur to a lot of people.
DigitalRoss
Probably because it's utterly ridiculous to use a ten-mile-long confusing explanation to justify a statement that is still completely wrong. The question was whether the bytes are in memory in that order, and they're not. The fact that they will appear backwards if you go out of your way to print them backwards proves nothing.
hobbs
No, this idea occurred to Carl Norum 5 hours before your post. The OP made a specific statement with: "A day later, the intern gets back to me and tells me the professor is now claiming that C is automagically converting the addresses to display the string in proper order." The OP seems to have faith in the intern who is passing the message for him, but that could surely be the problem. Also, the OP wants to know what is correct, and he seems to want some references. I agree with your psychoanalysis that this likely stemmed from a miscommunication, but does that answer the OP's question?
Heath Hunnicutt
When I'm sayng that the professor is confused, I mean that he's wrong to *insist* on one and only one representation method as *The Only True One*, while, as you yourself said above, they both are right. Moreover, there are more ways to interpret the memory contents in this case. Now, as an additional note, when one's talking about strings (sequences of bytes), trying to push a 4-byte int memory view as the only appropriate way to inspect the memory is what I'd call "unorthodox".
AndreyT
Frankly it doesn't matter whether the Prof understand endian-ness or not. The fact that his student has come away, having asked a specific question, with the impression that C is automagically converting addresses, means that (at least one of) the Prof's students don't understand endian-ness. Whoever first said the words "converting addresses" is in the wrong here, because that is what is wrong. Arguing over how little-endian memory should be represented is one thing, and sure both people can be right. Thinking that any addresses are being reversed by C is factually incorrect.
Steve Jessop
Look, assuming the intern I'm speaking with is giving me the facts accurately, the professor is simply wrong. Some here have argued that the professor is correct "from a certain point of view", i.e. the string can be "represented" as "P-yM azzi" if you use a debugger and interpret the memory as a 32-bit integer. Granted, this is true, but this is totally misleading and has no bearing on how the string is ACTUALLY stored in memory. And certainly, it is totally false that the C language does any kind of address "remapping" under the hood to compensate for endianness.
Charles Salvia
You're incorrect that this representation has no bearing on how strings are actually stored in memory. It describes the contents of the memory by a 1-1 mapping. If the Prof has said that for his course, this is how memory contents will be represented, then that's how the string in question "is represented". However, he's failed to explain what's actually going on, which is presumably a fault in the lessons. He's also wrong if he thinks that's the only way to describe memory, just as you'd be wrong to say lower-address bytes are ACTUALLY on the left.
Steve Jessop
Did you really need to post this as an answer AND as a comment?
mrduclaw
+1 (from zero) for a simple and simply expressed explanation of what the Prof was no doubt trying to say. Not sure why this was so controversial.
Bill Forster
A: 

Does the professor's "C" code look anything like this? If so, he needs to update his compiler.

main() {
    extrn putchar;
    putchar('Hell');
    putchar('o, W');
    putchar('orld');
    putchar('!*n');
}
jbcreix
A: 

AFAIK, endianness only makes sense when you want to break a large value into small ones. Therefore I don't think that C-style string are affected with it. Because they are after all just arrays of characters. When you are reading only one byte, how could it matter if you read it from left or right?

hab
+2  A: 

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

It would be represented as, represented as what? represented to user as 32bit integer dump? or represented/layout in computer's memory as P-yM azzi?

If the professor said "My-Pizza" would be represented/layout as "P-yM azzi" in computer's memory because the computer is of little endian architecture, somebody, please, got to teach that professor how to use a debugger! I think that's where all the professor's confusions stems from, I have an inkling that the professor is not a coder(not that I'm looking down upon the professor), I think he don't have a way to prove in code what he learned about endian-ness.

Maybe the professor learned the endian-ness stuff just about a week ago, then he just use a debugger incorrectly, quickly delighted about his newly unique insight on computers and then preach it to his students immediately.

If the professor said endian-ness of machine has a bearing on how ascii strings would be represented in memory, he need to clean up his act, somebody should correct him.

If the professor gave an example instead on how integers are represented/layout in machines differently depending on machine's endianness, his students could appreaciate what he is teaching all about.

Michael Buen