ansaurus

Question

Is it safe to use random Unicode for complex delimiter sequences in strings?

Answer 1

+4 A:

Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?

With XML you can specify the encoding in use, e.g.:

<?xml version="1.0" encoding="UTF-8"?>

RedFilter 2010-04-19 16:17:31

I'll be honest that I do not know what JSON is. XML, I would hazard, would result in significantly larger strings. If I want to give in and use more than 1 char, I can settle to just use the 2-char system, and as an added bonus, I won't have to add any new kind of parser (since you can't do regular expressions with XML).

ccomet 2010-04-19 16:43:00

JSON's significanly smaller in size than XML and larger than a single-character delimiter. Bang for your buck it's a pretty good compromise http://www.json.org/example.html

Nick Gotch 2010-04-19 16:46:53

XQuery and XPath both support regular expressions. You mentioned tiers so it sounds like your data is hierarchical; this will be handled much better by JSON or XML than by using a delimiter technique.

RedFilter 2010-04-19 17:01:36

@OrbMan For hierarchical data, indeed, JSON or XML is much better at handling it. Maybe my perspective is wrong... but for this particular system, I feel that spending 4+ times as many characters in tags just to store the information using JSON or XML doesn't feel like a good tradeoff. The class structure used by the rest of the application handles organizing the decoded data fairly well, too, so it just seems unnecessary. Again, maybe I'm looking at it from a bad perspective.

ccomet 2010-04-19 17:32:19

Answer 2

+2 A:

You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &.

Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.

Neil Barnwell 2010-04-19 16:17:50

Isn't that similar to just using ;# and ;|?

ccomet 2010-04-19 16:44:31

Answer 3

+1 A:

Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.

statichippo 2010-04-19 16:18:13

Answer 4

A:

Remember some of the laws of Murphy:

"Anything that can go wrong will."

"Anything that can't go wrong, will anyway."

Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.

You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.

Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.

Guffa 2010-04-19 16:47:23

Answer 5

+3 A:

There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.

The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.

After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.

That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.

Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.

In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".

Jeffrey L Whitledge 2010-04-19 17:06:26

ansaurus

tags:

views:

answers:

Is it safe to use random Unicode for complex delimiter sequences in strings?

related questions