ansaurus

Question

Enhancing an ASCII protcol with multilingual fields

Answer 1

+2 A:

Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.

Here's the master speaking.

You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.

bmargulies 2009-11-19 21:12:35

Switching *within* the string would be a bad idea, but I see no reason why he shouldn't be able to effectively have two different kinds of literal - a UTF-8-encoded-string and an ASCII-encoded-string.

Jon Skeet 2009-11-19 21:25:01

Answer 2

+2 A:

Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.

I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.

You could introduce a prefix for UTF-8-encoded strings, e.g. U:

230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>

would that help?

Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.

Jon Skeet 2009-11-19 21:16:06

This is, IMHO, a really bad idea. Either all the code that touches this is 8-bit clean, or it's not. If it's not, then putting that U in there won't help. If it is, then you're still byte-picking in the middle of strings. What if some other program splits it in the middle?

bmargulies 2009-11-19 21:17:53

The "U" would be an indicator to expect UTF-8 within the string. I agree that if anything's going to trample on the top bit, there'll be problems - just as I said in my final paragraph. Why shouldn't he be able to treat a U-prefixed string differently to a non-U-prefixed string? I'm not suggesting that he switches *within* the string, but that he treats either the whole of the string's data as UTF-8, or the whole of it as ASCII. Finding the end of the string shouldn't be a problem.

Jon Skeet 2009-11-19 21:24:16

Yes I have full control at both ends. Although, there is another developer at the other end who has his software written in foxpro. I'll be writing a new client as an ActiveX control which he will "eventually" use but that we'll be redistributing to other vendors. I guess I'm just trying to save time. I also like the ability to send messages to the server with telnet - going full unicode will break that.

Matt H 2009-11-19 22:20:03

The question here is the definition of 'within'. The overall message looks like a string to me. If the OP can really guarantee complete control of the interpretation, OK.

bmargulies 2009-11-19 22:25:30

@Matthew: If your test messsages using telnet only require ASCII, then you can go UTF-8 for everything without any issues - there'd be no difference at all between an all-ASCII message and the same message in UTF-8, because UTF-8 was designed to be ASCII compatible.

Jon Skeet 2009-11-19 22:27:46

@JON thanks... If that is true then I'll simply do that as that will maintain complete backward compatability right?

Matt H 2009-11-19 22:52:01

@Matthew Hook: Yes, any existing valid messages will be valid and mean the same thing when treated as UTF-8. (The reverse isn't true of course - UTF-8 messages containing non-ASCII characters would be rejected or misinterpreted by an "old" other end.) Going UTF-8 everywhere will certainly be a simpler solution :)

Jon Skeet 2009-11-19 23:37:24

Answer 3

+1 A:

Why don't you want the field to be "always interpreted as UTF-8"? You don't say.

If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

Jonathan Feinberg 2009-11-19 21:16:55

ansaurus

tags:

views:

answers:

Enhancing an ASCII protcol with multilingual fields

related questions