tags:

views:

188

answers:

7

I need this to be used as a delimiter,

has anyone known about this statistics?

A: 

I'm sure there are tons of strange unicode characters that don't get used much, but that's probably not what you're looking for.

Why do you want something "rare" for a delimiter? How will it be used?

chris
+1  A: 

What about using a string of characters as delimiter?

Thomas Stock
Same problem, just a little more rare :)
Joey
+10  A: 

Pick any character, then pick a mechanism to escape that character to handle the case where the user wants to type it. For example, in comma delimited files the comma is the separator:

1,2,fred,john

Unless the data itself contains a comma, then you quote it:

1,2,"Bloggs, Fred",john

And if you need use a quote:

1,2,"Bloggs, Fred","Jean-Luc \"Make it so\" Picard"
Greg Hewgill
I know there's going to be a good reason, but why not escape the comma? It feels like the " " is just another part of the delimiter
Thomas Stock
You could certainly escape the comma and not worry about quoting. However, I wanted to present a real-world example that contained a couple of different techniques.
Greg Hewgill
I think your direction is all right,but your solution is problematic.
Shore
+3  A: 

I don't think it matters what character you use, you shouldn't just hope that no-one will type your delimiter. Use a comma and handle the users adding their own commas.

Joshua Belden
yes,right,but how to handle commas if using a comma as a delimiter?
Shore
@Shore comma comma? :D
Patrick Gryciuk
Greg said it all.
Joshua Belden
+2  A: 

You could prefix whatever data you have on the web with the length.. that's how HTTP-Chunked encoding sends things across the web.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

Patrick Gryciuk
Other things using the "prefix a string with its length" technique: bencoding (http://en.wikipedia.org/wiki/Bencode), and Google's Protocol buffers (http://code.google.com/apis/protocolbuffers/docs/encoding.html#types)
Daniel Martin
+1  A: 

In such cases, I like the use the vertical bar | character.

  • It's easy to spot when looking at a text file.
  • It clearly marks a separation.
  • It's rarely used.
  • And, since it has no intrinsic meaning in English grammar, it is easy to either just disallow it or blindly change it to something else (like a dash) if it appears in the column text.
James Curran
+2  A: 

You sound like you're trying to convert a list of strings into a single string in such a manner that you can later turn it back into a list of strings.

There are several traditional approaches to this, most of them already mentioned in this thread:

  • Use an unusual character as a delimiter, and simply don't allow it in your input, either by rejecting input containing the delimiter, or by replacing the delimiter with "?" or "." or similar. For this, I agree with the person who suggested the vertical bar (|)
    • Advantage: dirt simple to code, in a wide variety of languages
    • Disadvantage: You lose some expressiveness and chances for future expansion by eliminating the possibility of input containing your delimiter.
  • Use a delimiter, and an escape mechanism when the delimiter appears in input. There are actually a few variants to this:
    • The "just like C code" method, where you prepend an escape character to every occurence in your data of your delimiter or your escape character. For example: the string «Greetings,Hey,Hello\,World,Hello \\ Backslash» contains four elements, using , as the delimiter and \ as the escape character. (The last element has one backslash originally)
      • This is actually a royal pain to code and implement correctly in many languages
      • Even once you do implement it, it's generally much slower compared to other methods
    • The "like URL parameters" method where your escape mechanism is to convert your delimiter into a multi-character sequence that does not contain your delimiter. You then also need to convert the first character of whatever your delimiter turns into to its own multi-character sequence. For example, if you decided to use , as your delimiter, and decide to represent , as «\1» and \ as «\2», you could write the last example as: «Greetings,Hey,Hello\1World,Hello \2 Backslash»
      • This is usually not too hard to implement. The advantage is that you can do the "splitting" and "unescaping" parts of going from string to list-of-strings in separate steps. The unescaping process might be a tiny bit tricky, since you have to do it as a scan of each string.
    • Like CSV files, with quotes around items that contain your delimiter, and the quotes escaped according to some obscure mechanism. (Such as by doubling)
      • Avoid this unless you can just throw it at a pre-existing library.
      • This has all the disadvantages of the "Like C code" method, plus extra confusing state to screw up when implementing it.
    • One of the above methods, but with a multi-character delimiter. This is harder than you'd think; the extra characters actually significantly complicate the logic of what exactly should be escaped.
  • Prefix each item with its length, then include the item unchanged
    • This is used by HTTP in its "Chunked" encoding, by bencoding (the wire format bittorrent uses), and by Google's protocol buffers.
    • Implementing this can be a tiny bit tricky, and is very prone to off-by-one errors. I still think it's easier to implement than the "like C code" method, especially in low-level languages.
    • Once you do implement it correctly, it's generally much faster than the other schemes, even the lossy scheme that just forbids input containing the delimiter. (The exception is if you're working in a high-level language that has a built-in "split" routine)
Daniel Martin
Personally I think the "just like C code" method is most clear,but seems not easy to implement in PHP..what a pity!
Shore
Yeah, everyone always thinks of "like C code", but it really is a pain to implement in pretty much any language. The "like URL parameters" method is much, much easier. Since you're working in PHP, you might want to do a two-pass approach: split() on the delimiter, then in each list element use str_replace to do the unescaping. Going the other way in PHP is easy too - just use str_replace on each list element to replace delimiter and escape characters, and then join with the delimiter.
Daniel Martin
For "like C code",the problem lies in identify '\' as whether it's really a '\' symbol or an escaping symbol,seems on way to implement in PHP.
Shore
Because in PHP,'\,' is the same as '\\,',so if we add a '\' before ',',will never know if it's escaping character or real '\'.
Shore