tags:

views:

90

answers:

4

Hi,

I need to implement a character encoding conversion function in C++ or C( Most desired ) from a custom encoding scheme( to support multiple languages in single encoding ) to UTF-8.

Our encoding is pretty random , it looks like this

Because of the randomness of this mapping, I am thinking to use std::map for mapping our encoding to UTF and vice versa in two different maps ,and use this maps for conversion. Is their any optimized data structure or way to do it.

+2  A: 

If your code points are contiguous, just make a big char * array and translate using that. I don't really understand what you mean by UTF-8 codepoint. UTF-8 has representations, and Unicode has codepoints. If you want code points, use an array of ints.

const int mycode_to_unicode [] = {
   0x00ff,
   0x0102,
   // etc.
 };

You could put a value like -1 if there are holes in your encoding to catch errors.

Going the other way is just making an array of structs of the same size of something like

struct {
   int mycode;
   int unicode;
};

copying the keys of the array into mycode and the values into unicode, and running it through qsort with a function which compares the values of unicode, then using bsearch with the same function to go from code point to your encoding.

This is assuming you want to use C.

Kinopiko
+1  A: 

As you build the constant mappings upfront and use it only for lookups, a hash table might be more ideal than std::map. There is no hash table implementation in the C++ standard, but many free implementations are available, both in C and C++.

These are C implementations:

http://www.cl.cam.ac.uk/~cwc22/hashtable/

http://wiki.portugal-a-programar.org/c:snippet:hash_table_c

Glibc hash tables.

Vijay Mathew
+1  A: 

Not sure if I understand the question, but if it's not too big a 1:1 mapping , using a preinitialized struct may be the way to go (depending on the code, you could write a program to once emit the content of the init table):

struct MAP { int from, to; };

MAP somemapping[MAXMAP]= {
    { 0x101,  0x01 },
    { 0x102,  0x02 },

};

Using bsearch() would be a reasonably quick way to do lookups;

If the code is extremely performance senstitive, you could build an index based lookup table:

int lookup[65536];


/* init build lookup table once */
init() 
{
  for (int i= 0; i<MAXMAP; i++) {
     lookup[somemapping[i].from]= somemapping[i].to;
  }
}



foo() 
{
  ....
   /* quick lookup */
  to= lookup[from];
  ....
}
Nicholaz
+2  A: 

An hashtable would surely be the fastest solution.

If a table is known upfront and never changes (as I understand it's the case), you can determine a perfect hash for it meaning that you will have no collision and assured costant retrieve time (at the expense of possibily some space).

I've used gperf a couple of times but I suggest you to check Bob Jenkins great page on hashing (and minimal perfect hashing as well)

Remo.D
If the data are all integers, the fastest way is a straight lookup table without any hashing. This involves some wasted memory though.
Kinopiko
Unicode code points are all integers, but their UTF-8 encodings aren't.
MSalters
If you look at the picture linked-to in the question, the things the questioner describes as UTF-8 code points are actually Unicode code points which aren't valid UTF-8. Even so, the fastest way to look up UTF-8 would surely be to treat each one to four byte sequence as an integer.
Kinopiko
The picture suggests that he's encoding from/to a 16 bits code to/from a single byte code (it is irrelevant here . If this is true, a complete lookup table would require 65536 bytes, while a minimal perfect hash would require 256 bytes. Even considering the code for the hash function, it should save space. Of course the other way around can simply be a uint16_t char[256].
Remo.D
but reading the text I understand that that he wants to go from an arbitrary UTF-8 sequence that maps an Unicode point to a single byte encoding. This means that he has to deal with the full Unicode set that, if I'm not mistaken, is now defined up to U+10FFFF
Remo.D
I think hash representation consumes more memory than simple look table ( one for UTF-8 to our encoding and other for reverse)?
siri
Have a look at minimal perfect hashing. If you have n keys, you know them in advance and they don't change, it is possible to determine a function h(x) that returns an integer from 0 to n and such that h(k1) = h(k2) <=> k1=k2.In your case you will only need an array of 256 bytes. Of course you will also have to link the hash function, but you will need some lookup code anyway, ...Minimal perfect hashing is normally used by compilers to look up the language keywords. I've added the Wikipedia link for and easier check.
Remo.D