views:

121

answers:

5

Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?


I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.

So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.

But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?

+4  A: 

Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?

With XML you can specify the encoding in use, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
RedFilter
I'll be honest that I do not know what JSON is. XML, I would hazard, would result in significantly larger strings. If I want to give in and use more than 1 char, I can settle to just use the 2-char system, and as an added bonus, I won't have to add any new kind of parser (since you can't do regular expressions with XML).
ccomet
JSON's significanly smaller in size than XML and larger than a single-character delimiter. Bang for your buck it's a pretty good compromise http://www.json.org/example.html
Nick Gotch
XQuery and XPath both support regular expressions. You mentioned tiers so it sounds like your data is hierarchical; this will be handled much better by JSON or XML than by using a delimiter technique.
RedFilter
@OrbMan For hierarchical data, indeed, JSON or XML is much better at handling it. Maybe my perspective is wrong... but for this particular system, I feel that spending 4+ times as many characters in tags just to store the information using JSON or XML doesn't feel like a good tradeoff. The class structure used by the rest of the application handles organizing the decoded data fairly well, too, so it just seems unnecessary. Again, maybe I'm looking at it from a bad perspective.
ccomet
+2  A: 

You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &amp;.

Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.

Neil Barnwell
Isn't that similar to just using ;# and ;|?
ccomet
+1  A: 

Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.

statichippo
A: 

Remember some of the laws of Murphy:

"Anything that can go wrong will."

"Anything that can't go wrong, will anyway."

Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.

You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.

Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.

Guffa
+3  A: 

There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.

The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.

After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.

That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.

Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.

In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".

Jeffrey L Whitledge