views:

354

answers:

5

I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...

I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.

In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.

However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?

+1  A: 

There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”

Bombe
A: 

In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.

I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.

Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?

If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.

Osama ALASSIRY
The use in 'closest codepage' is that in some programming languages, like python, it is easy to register a new code page that derives from an existing one. My specific case is not interesting to SO, the generic case of 'how to look up a code page that has a given character in a given position' is.
Thomas Vander Stichele
+3  A: 

If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?

In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).

If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.

Alan Moore
+1  A: 

What endian is the system? Perhaps you're flipping bit orders?

Statement
A: 

Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.

pat