views:

31

answers:

3

I am enhancing a piece of software that implements a simple ASCII based protocol.

The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):

AUTH 1 1 200<CR><LF>

To which we get a response looking similar to

230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>

The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.

I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.

+2  A: 

Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.

Here's the master speaking.

You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.

bmargulies
Switching *within* the string would be a bad idea, but I see no reason why he shouldn't be able to effectively have two different kinds of literal - a UTF-8-encoded-string and an ASCII-encoded-string.
Jon Skeet
+2  A: 

Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.

I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.

You could introduce a prefix for UTF-8-encoded strings, e.g. U:

230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>

would that help?

Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.

Jon Skeet
This is, IMHO, a really bad idea. Either all the code that touches this is 8-bit clean, or it's not. If it's not, then putting that U in there won't help. If it is, then you're still byte-picking in the middle of strings. What if some other program splits it in the middle?
bmargulies
The "U" would be an indicator to expect UTF-8 within the string. I agree that if anything's going to trample on the top bit, there'll be problems - just as I said in my final paragraph. Why shouldn't he be able to treat a U-prefixed string differently to a non-U-prefixed string? I'm not suggesting that he switches *within* the string, but that he treats either the whole of the string's data as UTF-8, or the whole of it as ASCII. Finding the end of the string shouldn't be a problem.
Jon Skeet
Yes I have full control at both ends. Although, there is another developer at the other end who has his software written in foxpro. I'll be writing a new client as an ActiveX control which he will "eventually" use but that we'll be redistributing to other vendors. I guess I'm just trying to save time. I also like the ability to send messages to the server with telnet - going full unicode will break that.
Matt H
The question here is the definition of 'within'. The overall message looks like a string to me. If the OP can really guarantee complete control of the interpretation, OK.
bmargulies
@Matthew: If your test messsages using telnet only require ASCII, then you can go UTF-8 for everything without any issues - there'd be no difference at all between an all-ASCII message and the same message in UTF-8, because UTF-8 was designed to be ASCII compatible.
Jon Skeet
@JON thanks... If that is true then I'll simply do that as that will maintain complete backward compatability right?
Matt H
@Matthew Hook: Yes, any existing valid messages will be valid and mean the same thing when treated as UTF-8. (The reverse isn't true of course - UTF-8 messages containing non-ASCII characters would be rejected or misinterpreted by an "old" other end.) Going UTF-8 everywhere will certainly be a simpler solution :)
Jon Skeet
+1  A: 

Why don't you want the field to be "always interpreted as UTF-8"? You don't say.

If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

Jonathan Feinberg