tags:

views:

55

answers:

1

I was wondering why both utf-16le and utf-16be exists? Is it considered to be "inefficient" for a big-endian environment to process a little-endian data?

Currently, this is what I use while storing 2 bytes var locally:

  unsigned char octets[2];
  short int shotint = 12345; /* (assuming short int = 2 bytes) */
  octets[0] = (shortint) & 255;
  octets[1] = (shortint >> 8) & 255);

I know that while storing and reading as a fixed endianness locally - there is no endian risk. I was wondering if it's considered to be "inefficient"? what would be the most "efficient" way to store a 2 bytes var? (while restricting the data to the environment's endianness, local use only.)

Thanks, Doori Bar

+2  A: 

This allows code to write large amounts of Unicode data to a file without conversion. During loading, you must always check the endianess. If you're lucky, you need no conversion. So in 66% of the cases, you need no conversion and only on 33% you must convert.

In memory, you can then access the data using the native datatypes of your CPU which allows for efficient processing.

That way, everyone can be as happy as possible.

So in your case, you need to check the encoding when loading the data but in RAM, you can use an array of short int to process it.

[EDIT] The fastest way to convert a 16bit value to 2 octets is:

char octet[2];
short * prt = (short*)&octet[0];
*ptr = 12345;

Now you don't know if octet[0] is the low or upper 8 bits. To find that out, write a know value and then examine it.

This will give you one of the encodings; the native one of your CPU.

If you need the other encoding, you can either swap the octets as you write them to a file (i.e. write them octet[1],octet[0]) or your code.

If you have several octets, you can use 32bit integers to swap two 16bit values at once:

char octet[4];
short * prt = (short*)&octet[0];
*ptr ++ = 12345;
*ptr ++ = 23456;

int * ptr32 = (int*)&octet[0];
int val = ((*ptr32 << 8) & 0xff00ff00) || (*ptr >> 8) & 0x00ff00ff);
Aaron Digulla
Thanks for the fast response, any chance you can show me a basic sample of how to convert a 2 bytes var to 2 octets, natively? (while ignoring endianness, for local use only)
Doori Bar
Correct me if I'm wrong - but according to your answer I assumed that my code was indeed inefficient. (for local use only)
Doori Bar
Your code is inefficient when you use that to write the unicode data to a file (unless you must use utf16-le as encoding).
Aaron Digulla
This is the efficient way to do it? http://codepad.org/4lESCv0G , or I got it all wrong?
Doori Bar
Misunderstanding :-) Your code is efficient if you need to convert 16bit native Unicode -> UTF-16LE. I'm saying that you should try to avoid the conversion.
Aaron Digulla
But for the code on codepad: Turn `octets` into a `char *` pointer, assign it the address of `shortint` and then access the values directly with `octet[0/1]`.
Aaron Digulla
No need to convert anything ... It's 100% native operation for local use only :) - my paste over http://codepad.org/4lESCv0G is still converting?
Doori Bar
I see, thanks a lot! I think you made things clear for me now.
Doori Bar
Yes because you copy data around in memory. My solution just uses clever pointer arithmetic. See http://codepad.org/dBQ0WSaw
Aaron Digulla