ansaurus

Question

Least used delimiter character in normal text < ASCII 128

Answer 1

+5 A:

Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.

Jay 2009-01-29 15:38:20

Answer 2

+11 A:

How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.

Alex Fort 2009-01-29 15:38:31

I like this better than my idea. +1.

IainMH 2009-01-29 15:40:07

I think a comma counts as common character in normal text. If it were as simple as using CSV I doubt there'd be a need to ask the question...

Jay 2009-01-29 15:43:16

Commas can be escaped in a CSV format, however.

Alex Fort 2009-01-29 15:45:05

csv deals with commas in normal text as well as a few other issues. So it dosn't matter that there is a comma allready in the text. IIRC it puts text in quotes and escapes quotes.

Jeremy French 2009-01-29 15:46:32

@Jeremy: exactly right. Here's a wikipedia article mentioning how the escaping scheme works: http://en.wikipedia.org/wiki/Comma-separated_values

rmeador 2009-01-29 16:28:59

To put it bluntly: CVS will deal with all those issues which you didn't think of and make sure that you won't have to fix your "solution" every two weeks because it breaks due to some unforeseen input.

Aaron Digulla 2009-01-29 16:40:34

I was assuming (perhaps wrongly) that the data is not escaped and for some reason there's inadequate control over the data source to ensure it will be properly escaped. Otherwise it's always preferable to use an existing library of course.

Jay 2009-01-29 17:59:13

Answer 3

+3 A:

Probably | or ^ or ~ you could also combine two characters

SQLMenace 2009-01-29 15:38:49

Answer 4

A:

You're probably going to have to pick something and ignore it's other uses.

might be a good candidate.

IainMH 2009-01-29 15:39:25

Answer 5

+1 A:

Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.

Jackson 2009-01-29 15:39:34

Answer 6

+2 A:

Pipe for the win! |

Eppz 2009-01-29 15:41:43

Answer 7

+3 A:

Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.

The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.

I personally think I'd go for | (pipe) if given a choice but going with real data is safest.

And whatever you do, make sure you've worked out an escaping scheme!

Nick Fortescue 2009-01-29 15:48:32

Answer 8

A:

We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.

Joe 2009-01-30 01:09:42

Answer 9

+1 A:

You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.

(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)

If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (@ or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.

Jason S 2009-01-30 01:29:35

Answer 10

A:

I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.

2009-02-11 05:28:27

Answer 11

A:

This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.

I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.

CSV is probably a better idea for most situations, though.

Coxy 2009-02-11 05:59:43

Answer 12

A:

When using different languages, this symbol: ¬

proved to be the best. However I'm still testing.

Icarin 2010-09-01 16:49:34

ansaurus

tags:

views:

answers:

Least used delimiter character in normal text < ASCII 128

related questions