views:

1401

answers:

20

I have several textboxes where users can enter information into them. This can include commas, so I can't use the standard comma delimited strings.

What is a good delimiter to denote that strings should be separated based on that character that isn't typically used by users in their writings? I'm going to be combining these fields into a string string and passing them off to my Encryption method I have. After I decrypt them I need to be able to reliably separate them.

I'm using C# if it matters.

+13  A: 

| would be next on my list and is often used as an alternative to CSV. google "pipe delimited" and you will find many examples.

string[] items = new string[] {"Uno","Dos","Tres"};

string toEncrypt = String.Join("|", items);

items = toEncrypt.Split(new char[] {'|'}, StringSplitOptions.RemoveEmptyEntries);

foreach(string s in items)
  Console.WriteLine(s);

And since everyone likes to be a critic about the encoding and not provide the code, here is one way to encode the text so your | delim won't collide.

string[] items = new string[] {"Uno","Dos","Tres"};

for (int i = 0; i < items.Length; i++)
    items[i] = Convert.ToBase64String(Encoding.UTF8.GetBytes(items[i]));

string toEncrypt = String.Join("|", items);

items = toEncrypt.Split(new char[] {'|'}, StringSplitOptions.RemoveEmptyEntries);

foreach (string s in items)
     Console.WriteLine(Encoding.UTF8.GetString(Convert.FromBase64String(s)));
Chad Grant
This still relies on the kind of wishful thinking that causes security errors. Hoping that no user will ever enter a | just doesn't cut it. Regardless of separator, there has to be support for escaping. That, or the separator has to be blocked so the users can't enter them at all.
Emil H
@Emil noted. See my comment on mP's answer. Someone should provide the code for escaping. Maybe you're game?
Chad Grant
Nothing personal. Just wanted to add a note for people passing by. :)
Emil H
I wasn't taking it personal, just trying to give someone else a chance to post the code. ;)
Chad Grant
@DeviantMate - watching out for the delimiter and escaping it and doing so in reverse is pretty basic. The important thing to point out was the best way to solve the problem, coding is trivial and a sample not necessary.
mP
Your code is worse than not giving an answer its convoluted and confusing for a newbie. I cant believe it got as many votes as it did, considering its wrong.So many formats in the real world eg XML and HTML escape their special characters, they never never base64 encode the entire thing just to escape a handful of characters here and there.
mP
The example is just that, an example. It may be using a sledgehammer when a hammer would have been more appropriate. However at least the people in this thread attempted to help solve his problem, while you nitpicked and didn't bust out a line of code. Base64 is appropriate and used ALOT in web dev ... so hopefully he will learn more than you did from it. There were two other people providing foreach(char) in this thread, so I did a 3rd alternative. Teaching options is important. You cant solve everything with a List<T>
Chad Grant
A: 

The backtick. Nobody uses the backtick.

Promit
agreed, however windows copy/paste sometimes replaces single quotes with backticks, dunno why but it's frustrating
Chad Grant
Except that it`s commonly used by people who can`t find the apostrophe...
Guffa
Or people who think ``smart quotes'' are really nifty, but like low-tech, roll-your-own solutions.
Steve Jessop
A: 

The pipe character (|), perhaps? If your user base is remotely IT-shy, then this approach (asking them to delimit their text) might not be the best one to take; you could try something else, e.g. provide some means of dynamically adding a text box on the fly which accepts another string, etc.

If you provide a little more information about what you're doing, and for whom, it might be possible for someone to suggest an alternative approach.

Rob
+1  A: 

Newline? (i.e. use a multi-line text box)

Tim Robinson
+6  A: 

The best solution is to stick to commas and introduce support for character escaping. Whatever character you select will eventually need to be entered so you may aswell provide support for this.

Think backslases + double quotes inside double quoted strings.

Don't pick a character like backtick because some users might not know how to type it in...

mP
Good idea to point out escaping, please post some code of how he might implement that
Chad Grant
From what I understand from the question, the user will not have to enter the delimiting character, so back ticks would be okay to use. Escaping would still be a better solution though.
DeadHead
Agreed. Escaping is the way to go no matter how you do it. Given that, picking a delimiter is an optimization.
David Berger
@DeadHeadIt doesnt matter where the data comes from, you still end up with a problem when the user enters the character you are using as your delimiter.
mP
@mP: "Don't pick a character like backtick because some users might not know how to type it in..." Read the post next time before you criticize please.
DeadHead
@DeadHead:No where does it say the entered data can not have a backtick. Thinking otherwise is silly and encouraging otherwise when a proper solution is available is .....
mP
+1  A: 

I would suggest using ";"

Izabela
; is reasonably common in text, I'd say. Of course, without more information about the nature of the strings being delimited, it's hard to say.
Rob
A: 

I prefer to use a combination of characters that would not likely be entered a by a normal person as my delimiter when possible. For example, I've used ")^&^(" and set it up as a const "cDelimiter" in my code; then concatenated all of my fields with that. By using a small unique string, I greatly reduce the likely hood of the user accidentally entering my delimiter. The likely hood of a user entering a | or a ~ is admittedly unlikely, but it doesn't mean it won't happen.

Frank Rosario
Colin Burnett
My answer assumed that the delimiting was being done by the code itself, not left to the user to delimit the fields. The point of my answer was the user would never be expected to enter a string such as provided in my example.
Frank Rosario
+3  A: 

Any of the non-standard character pipe |, backtick `, tilde ~, bang !, or semi-colon ; would probably work. However, if you go this route you are really venturing away from usability. Asking them to escape commas with a backslash or something is begging for them to miss one.

If CSV is not possible then you should consider changing your UI. (Heck, you should stay away from CSV anyway for a user input!) You say textbox so I assume you're in web or some kind of win forms or WPF (definitely not a console). All of those give you better UI control than a single textbox and forcing users to conform to your difficult UI design.

More information would definitely help better guide answers.

However, as an example of escaping a comma with a backslash. Note that you cannot escape the backslash before a comma with this. So @"uno, dos, tr\\,es" will end up with {"uno", " dos", "tr\es"}.

string data = @"uno, dos, tr\,es";
string[] items = data.Split(','); // {"uno", " dos", @"tr\", "es"}
List<string> realitems = new List<string>();
for (int i=items.Length-1; i >= 0; i--)
{
    string item = items[i];
    if (item.Length == 0) { realitems.Insert(0, ""); continue; }

    if (realitems.Count == 0) { realitems.Insert(0, item); }
    else
    {
        if (item[item.Length - 1] == '\\') { realitems[0] = item + "," + realitems[0]; }
        else { realitems.Insert(0, item); }
    }
}

// Should end up with {"uno", " dos", "tr,es"}
Colin Burnett
string data = @"uno,,,,,, dos, tr\,es"; = kaboom
Chad Grant
Added Length == 0 check which adds "" to the list. Fixed?
Colin Burnett
+1  A: 

I figure eventually, every character is going to be used by someone. Users always find a way to break our HL7 parser.

Instead of a single character, maybe try a string that would be random enough that nobody'd ever use it. Something like "#!@!#".

Chris Doggett
I did exactly the same thing when parsing HL7.
Even Mien
+2  A: 

Will the user be entering delimited strings into the textboxes, or will they be entering individual strings which will then be built into delimited strings by your code?

In the first case it might be better to rethink your UI instead. eg, The user could enter one string at a time into a textbox and click an "Add to list" button after each one.

In the second case it doesn't really matter what delimiter you use. Choose any character you like, just ensure that you escape any other occurrences of that character.

EDIT

Since several comments on other answers are asking for code, here's a method to create a comma-delimited string, using backslash as the escape character:

public static string CreateDelimitedString(IEnumerable<string> items)
{
    StringBuilder sb = new StringBuilder();

    foreach (string item in items)
    {
        sb.Append(item.Replace("\\", "\\\\").Replace(",", "\\,"));
        sb.Append(",");
    }

    return (sb.Length > 0) ? sb.ToString(0, sb.Length - 1) : string.Empty;
}

And here's the method to convert that comma-delimited string back to a collection of individual strings:

public static IEnumerable<string> GetItemsFromDelimitedString(string s)
{
    bool escaped = false;
    StringBuilder sb = new StringBuilder();

    foreach (char c in s)
    {
        if ((c == '\\') && !escaped)
        {
            escaped = true;
        }
        else if ((c == ',') && !escaped)
        {
            yield return sb.ToString();
            sb.Remove(0, sb.Length);
        }
        else
        {
            sb.Append(c);
            escaped = false;
        }
    }

    yield return sb.ToString();
}

And here's some example usage:

string[] test =
    {
        "no commas or backslashes",
        "just one, comma",
        @"a comma, and a\ backslash",
        @"lots, of\ commas,\ and\, backslashes",
        @"even\\ more,, commas\\ and,, backslashes"
    };

    string delimited = CreateDelimitedString(test);
    Console.WriteLine(delimited);

    foreach (string item in GetItemsFromDelimitedString(delimited))
    {
        Console.WriteLine(item);
    }
LukeH
A: 

Detect a character that is not used, and then use that. Your final combined string can start with the character that is to be from that point used as the delimiter.

example: your users enter "pants" ",;,;,;,;,;" and "|~~|" You iterate through a set of characters until you find one that is not used. Could be, say, "$" Your final, concatenated string then, is "$pants$,;,;,;,;,;$|~~|" The initial character tells your program what character is to be used as the delimiter. This way, there are no forbidden characters, period.

pyrochild
And if the user manages to use every character in the encoding, they win a no-prize :-)
Steve Jessop
IF that's a serious enough possibility, then it can fairly easily be extended to search for an unused 2 or even 3 character pattern, exponentially increasing the number of combinations the user would have to enter.
pyrochild
+1  A: 

I assume from what you say that the user is entering data into separate fields, and then you are combining it. So the user never needs to know or care what the delimiter is.

Don't just try to pick a character that "nobody ever uses", because either by accident or in order to try to break your code, some user will eventually use it.

So, I would either:

  • Insert backslashes to escape commas and backslashes in the user input, then combine the strings with commas. To separate, you split on unescaped commas (which is a job for a state machine), then unescape each component.

  • Use an off-the-shelf means of serializing a list of strings. What's available depends on your environment, I don't know C#/.NET well enough to advise. In Java you could just serialize a vector or whatever.

  • Separate the data with a control character like ASCII-BEL or ASCII-VT (or ASCII-NUL if your strings are never treated as nul-terminated), and reject user input which contains that character.

The first option is good if the user has to be allowed to enter any char values they like. The second option is good if you don't care about bloating the data significantly. The third option is good if you don't mind rejecting smart-alec users (or those with unusual requirements) who try to insert funny data.

Steve Jessop
+1  A: 

I have seen unusal characters used as delimiters, even unusal character combinarions like -|::|-, but eventhough they are more unlikely to occur, they still can.

You have basically two options if you want to make it water tight:

1: Use a character that is impossible to type, like the '\0' character:

Join:

string combined = string.Join("\0", inputArray);

Split:

string[] result = combined.Split('\0');

2: Escape the string and use an escaped character as delimiter, like url encoding the values and use & as delimiter:

Join:

string combined = string.Join("&", inputArray.Select<string,string>(System.Web.HttpUtility.UrlEncode).ToArray());

Split:

string[] result = combined.Split('&').Select<string,string>(System.Web.HttpUtility.UrlDecode).ToArray();
Guffa
+1  A: 

As has been noted, any character that you choose has the chance of appearing in the input, so you have to handle escaping. XML may be a good serialization format to use, since I believe that .NET has good XML creation and deletion support. This is likely to be much more robust than trying to implement your own character escaping, and will also be more extensible in the future.

Mike Ottum
+1  A: 

Nobody said TAB? Tab delimited is great but it isn't easy to type tabs into GUIs (it tends to move you to the next screen element). But for files generated by computer TAB is perfect since it really should never appear in user generated text.

jmucchiello
A: 

Use a tab (or maybe \n) - which if entered by the user would cause the text box to be exited.

le dorfier
+1  A: 

Why don't you just wrap each input in quotes?

That way you end up with this:

"Aaron","Johnson","25","I like cats, and dogs"

Don't forget to escape quotes on input...

ChristianLinnell
+4  A: 

I don't think I've willingly self-delimited a collection of strings since I stopped using C. There's just no need for it in a "modern" language, and - while trivial - the number of edge cases are enough to annoy you to death.

Store them in a List<string> or string[] and serialize/deserialize them. Use XML if you want human readability or interop - or binary serialze them if you don't. You can encrypt the output easily either way, and there's no ambiguity or create your own escaping routines needed.

In C#, it's less LOC and takes less time to write than this answer did. There's no excuse to rolling your own solution.

Mark Brackett
A: 

I also support the selection of TAB (\t) and to some extend the PIPE (|) symbol.

But the most used one in my experience is the semicolon (;) together with quoted fields and the escapes for \ and \" which is just perfect. Just needs a parser keeping the state. The actual delimiting char becomes unimportant.

If you use no escape it is wise to count the "fields" per line and compare them to your expected results. As most applications of this kind of files use some kind of fixed number of fields you can catch errors in the entry and get this everything is good feeling if it does not trigger.

OderWat
+1  A: 

Mark Brackett has the correct answer. I'll only add that the very number of answers to this simple question should put you off of using delimited strings, ever. Let this be a "word to the wise".

John Saunders