ansaurus

Question

Why both utf-16le and utf-16be exists? endianness efficiency - C

Answer 1

+2 A:

This allows code to write large amounts of Unicode data to a file without conversion. During loading, you must always check the endianess. If you're lucky, you need no conversion. So in 66% of the cases, you need no conversion and only on 33% you must convert.

In memory, you can then access the data using the native datatypes of your CPU which allows for efficient processing.

That way, everyone can be as happy as possible.

So in your case, you need to check the encoding when loading the data but in RAM, you can use an array of short int to process it.

[EDIT] The fastest way to convert a 16bit value to 2 octets is:

char octet[2];
short * prt = (short*)&octet[0];
*ptr = 12345;

Now you don't know if octet[0] is the low or upper 8 bits. To find that out, write a know value and then examine it.

This will give you one of the encodings; the native one of your CPU.

If you need the other encoding, you can either swap the octets as you write them to a file (i.e. write them octet[1],octet[0]) or your code.

If you have several octets, you can use 32bit integers to swap two 16bit values at once:

char octet[4];
short * prt = (short*)&octet[0];
*ptr ++ = 12345;
*ptr ++ = 23456;

int * ptr32 = (int*)&octet[0];
int val = ((*ptr32 << 8) & 0xff00ff00) || (*ptr >> 8) & 0x00ff00ff);

Aaron Digulla 2010-07-27 12:54:49

Thanks for the fast response, any chance you can show me a basic sample of how to convert a 2 bytes var to 2 octets, natively? (while ignoring endianness, for local use only)

Doori Bar 2010-07-27 12:59:22

Correct me if I'm wrong - but according to your answer I assumed that my code was indeed inefficient. (for local use only)

Doori Bar 2010-07-27 13:00:39

Your code is inefficient when you use that to write the unicode data to a file (unless you must use utf16-le as encoding).

Aaron Digulla 2010-07-27 13:08:19

This is the efficient way to do it? http://codepad.org/4lESCv0G , or I got it all wrong?

Doori Bar 2010-07-27 13:18:48

Misunderstanding :-) Your code is efficient if you need to convert 16bit native Unicode -> UTF-16LE. I'm saying that you should try to avoid the conversion.

Aaron Digulla 2010-07-27 13:21:34

But for the code on codepad: Turn `octets` into a `char *` pointer, assign it the address of `shortint` and then access the values directly with `octet[0/1]`.

Aaron Digulla 2010-07-27 13:27:37

No need to convert anything ... It's 100% native operation for local use only :) - my paste over http://codepad.org/4lESCv0G is still converting?

Doori Bar 2010-07-27 13:28:05

I see, thanks a lot! I think you made things clear for me now.

Doori Bar 2010-07-27 13:31:46

Yes because you copy data around in memory. My solution just uses clever pointer arithmetic. See http://codepad.org/dBQ0WSaw

Aaron Digulla 2010-07-27 13:41:46

ansaurus

tags:

views:

answers:

Why both utf-16le and utf-16be exists? endianness efficiency - C

related questions