views:

145

answers:

6

When we want to split a sting for any kind of reasons, we (at least myself) tend to split using the (pipe) | character as it is very rare to find out someone or that the application uses it on a string ... but what happens if it uses?

Well, a simple Crash will be thrown :)

I found out that a college uses non-printable chars to do the same technique, for example:

String.Format(
         "{1}{0}{2}{0}{3}{0}{4}",
         (char)2,
         myFirstString,
         mySecondString,
         myThirdString,
         myFourthString);

and when we want to extract the hole string into it's parts

String.Split((char)2);

Is this safe? Should I adopt this way of safely splitting string? Is there any other safety technique?

+2  A: 

It'd be better to never concatenate the strings together in the first place if you can help it. Splitting like this is a code smell.

Sure, using a control character is "more likely" to not have issues, but it's still not perfect. If you really have to do this, use NUL (\0). That character at least has a history of being a string sentinel.

John Kugelman
`"It'd be better to never concatenate the strings together in the first place"` How would you set a `string` `Description` field that should contain extra info, if exists, but the other way around you need to add the extra info. For example: from A -> B, you need to append the extra info, but when reading from B -> A, the extra info should never be on A.
balexandre
+3  A: 

This is essentially a contract between the applications that produce strings in this format, and those that consume them - use whatever is appropriate for your situation.

You might want to consider if flattening multiple strings into a single giant string is necessary in the first place. If their reason for existence is is solely for representing 'separated' textual data within your application, you might want to produce the data as a sequence of strings (a string[] for example) right from the beginning. In this case, no 'parsing' will be necessary.

If, on the other hand, the data must be persisted and consumed at a later point, there are several options. For example:

  1. Database: Store each string as a row in a database table. No splitting is required.
  2. Designated Delimiter: Store the strings in a flat-file with a 'special' separator that signifies the end of the current string. Obviously, this character must be such that it can't be part of a legal sub-string. E.g. If your strings can't contain a pipe-character as you say, then this is a reasonable choice for a delimiter.
  3. Escape-sequences: E.g. * is the separator, ** represents an asterisk within a string. This will mean that no character is reserved for use as a sentinel (rendering it unrepresentable). On the downside, parsing becomes a non-trivial task.
  4. Purpose-built format: E.g. XML. When you consider that this requires that certain characters be 'escaped', this is essentially an extension of point 3, but the problem has now been punted to your XML libraries.
Ani
+1, very thoughtful answer
John M Gant
@John: Why? Because it uses lots of buzzwords?
Timwi
@Timwi: Thanks for your critiscism. Could you tell me which parts are unclear or use 'handwaving' buzzwords? That is not the intention at any rate, and I would like the meaning to be clearly communicated. Thanks.
Ani
@Ani: Hehe, sorry if I hurt any feelings... there is nothing wrong with your answer, my criticism was actually leveled at John.
Timwi
@Timwi: No feelings hurt. Obviously, the OP has a genuine question that deserves some genuine discussion, and I don't want to appear to be throwing buzzwords at a real problem. That's why asked.
Ani
@Timwi, I don't see any buzzwords here, just a clear and well-reasoned answer. Maybe I'm a CIO and don't know it.
John M Gant
@Timwi, BTW, I upvoted your answer too, even without the buzzwords, right after I upvoted Ani's. I think you both make good points.
John M Gant
+7  A: 

It may be “safer” than the pipe because it is rarer, but both ways are suboptimal because they limit you to a subset of possible strings.

Consider using a proper encoding — one that unambiguously encodes a list of arbitrary strings. The simplest in terms of coding is probably to simply serialize a string[]. You could use BinaryFormatter or XmlSerializer or something else.

If the result has to be a string, and it has to be a short one, then you could try something like this:

  • Encoding: (list of strings to single string)
    • Replace every ! with !e and every | with !p in every string. Now, none of the strings contains a | and you can easily reverse this.
    • Concatenate the strings using | as a separator.
  • Decoding: (single string back to list of strings)
    • Split on the | character.
    • Replace all !p with | and !e with ! in every string. This recovers the original strings.
Timwi
What if original string has "!p"?
Grozz
@Grozz: That would have gotten changed to `!!p` and would still be unambiguous. But I see now that if the replacement is done in two separate passes instead of one, there would be a bug, so to make it easier to implement I changed `!!` to `!e`. Happy?
Timwi
+1 for `string[]` and `BinaryFormatter`, which avoid the problem entirely by using a built-in structure for handling this issue.
Brian
+2  A: 

I think using non-printable characters is more obscur than safe. If you want safety, a solution would be to serialize/deserialize your List<string>.

hoang
+1  A: 

You can go for a normal CSV reader/writer. This helps you because when a value has the separator, it's enclosed in double quotes:

a,b,"c,d"

produces:

new[] { "a", "b", "c,d" }

This may help http://www.codeproject.com/KB/database/CsvReader.aspx

Pieter
A: 

It depends on the expected content of the string. If the expected strings could have non printable characters then maybe not. The other way is to escape you strings that you are going to split, it looks like more work but could be put into a reusable helper:

var string1 = "string|1";
var string2 = "string |2";
var string3 = "string| 3";
var string4 = "string | 4";

var stringToSplit = MergeStrings(string1, string2, string3, string4);

var results = SplitString( stringToSplit );

foreach(string result in results)
{
    Trace.WriteLine( result );
}

Which uses the following methods.

public string MergeStrings(params string[] strings)
{
    var stringBuilder = new StringBuilder();

    foreach(var s in strings)
    {
        stringBuilder.Append( s.Replace( "|", "||" ) );
        stringBuilder.Append( " | " );
    }

    return stringBuilder.ToString();
}

public IEnumerable<string> SplitString(string stringToSplit)
{
    var results = stringToSplit.Split( new[] { " | " }, StringSplitOptions.RemoveEmptyEntries );

    return results.Select( result => result.Replace( "||", "|" ) );
}

You would probably want to make the separator character customizable.

Bronumski