views:

3496

answers:

8

I'd like to write an extension method for the .NET String class. I'd like it to be a special varation on the Split method - one that takes an escape character to prevent splitting the string when a escape character is used before the separator.

What's the best way to write this? I'm curious about the best non-regex way to approach it.
Something with a signature like...

public static string[] Split(this string input, string separator, char escapeCharacter)
{
   // ...
}

UPDATE: Because it came up in one the comments, the escaping...

In C# when escaping non-special characters you get the error - CS1009: Unrecognized escape sequence.

In IE JScript the escape characters are throw out. Unless you try \u and then you get a "Expected hexadecimal digit" error. I tested Firefox and it has the same behavior.

I'd like this method to be pretty forgiving and follow the JavaScript model. If you escape on a non-separator it should just "kindly" remove the escape character.

+1  A: 

This will need to be cleaned up a bit, but this is essentially it....

List<string> output = new List<string>();
for(int i=0; i<input.length; ++i)
{
    if (input[i] == separator && (i==0 || input[i-1] != escapeChar))
    {
        output.Add(input.substring(j, i-j);
        j=i;
    }
}

return output.ToArray();
James Curran
+1  A: 

The signature is incorrect, you need to return a string array

WARNIG NEVER USED EXTENSIONs, so forgive me about some errors ;)

public static List<String> Split(this string input, string separator, char escapeCharacter)
{
    String word = "";
    List<String> result = new List<string>();
    for (int i = 0; i < input.Length; i++)
    {
//can also use switch
     if (input[i] == escapeCharacter)
     {
      break;
     }
     else if (input[i] == separator)
     {
      result.Add(word);
      word = "";
     }
     else
     {
      word += input[i];    
     }
    }
    return result;
}
PoweRoy
nice catch. I'll go fix that in the original question.
tyndall
+3  A: 

How about:

public static IEnumerable<string> Split(this string input, 
                                        string separator,
                                        char escapeCharacter)
{
    int startOfSegment = 0;
    int index = 0;
    while (index < input.Length)
    {
        index = input.IndexOf(separator, index);
        if (index > 0 && input[index-1] == escapeCharacter)
        {
            index += separator.Length;
            continue;
        }
        if (index == -1)
        {
            break;
        }
        yield return input.Substring(startOfSegment, index-startOfSegment);
        index += separator.Length;
        startOfSegment = index;
    }
    yield return input.Substring(startOfSegment);
}

That seems to work (with a few quick test strings), but it doesn't remove the escape character - that will depend on your exact situation, I suspect.

Jon Skeet
It looks like you're assuming that anytime the escape character appears it's followed by the separator string. What if it isn't?
tvanfosson
I'm only going on what's in the question - if the escape character appears before the separator, it should prevent that separator from being used for splitting. I don't try to remove the escape character or process it in any other way. Naive, perhaps, but that's all the information we've got.
Jon Skeet
cool, what is the benefit of ienumberable over returning a string array?
rizzle
Deferred execution and streaming - we don't need to buffer everything up.
Jon Skeet
Jon, updated the question (top) to include the escape removal question. Never thought of the "yield" strategy... interesting. +1
tyndall
@Jon -- I'm thinking that escape character semantics are reasonably well known and an extension method ought to work within those semantics. Just my preference.
tvanfosson
@tvanfosson: In my experience escape character semantics vary considerably. Should it translate \n into a linefeed, for example? That's way beyond the scope of a *splitting* method, IMO.
Jon Skeet
@Bruno: I would handle unescaping in a separate method, particularly if the escape character is going to be used for more than just "don't escape the separator". It can get quite involved. Having said that, if the escape character escapes itself, it could get tricky. e.g. "foo\\,bar" is "foo\" "bar"
Jon Skeet
(Assuming a '\' escape character and a "," separator.)
Jon Skeet
I'm a little green on parsing, but shouldn't the escape character put the "state" into a special mode for one character only. Then once you pass this one character, return back to regular mode. Then \\, situations are not that tricky. \\ would turn into \ and the separator , would be processed.
tyndall
Thanks for all the input. I might consider the unescaping in a separate method. Especially, if it makes the code more readable/maintainable.
tyndall
@Bruno: Your "state" comment is right, if an escape character can escape itself. Basically it will all depend on what your escaping requirements.
Jon Skeet
+2  A: 

My first observation is that the separator ought to be a char not a string since escaping a string using a single character may be hard -- how much of the following string does the escape character cover? Other than that, @James Curran's answer is pretty much how I would handle it - though, as he says it needs some clean up. Initializing j to 0 in the loop initializer, for instance. Figuring out how to handle null inputs, etc.

You probably want to also support StringSplitOptions and specify whether empty string should be returned in the collection.

tvanfosson
+1 All good points
tyndall
+1  A: 

Personally I'd cheat and have a peek at string.Split using reflector... InternalSplitOmitEmptyEntries looks useful ;-)

Si
+2  A: 
        public static string[] Split(this string input, string separator, char escapeCharacter)
        {
            Guid g = Guid.NewGuid();
            input = input.Replace(escapeCharacter.ToString() + separator, g.ToString());
            string[] result = input.Split(new string []{separator}, StringSplitOptions.None);
            for (int i = 0; i < result.Length; i++)
            {
                result[i] = result[i].Replace(g.ToString(), escapeCharacter.ToString() + separator);
            }

            return result;
        }

Probably not the best way of doing it, but it's another alternative. Basically, everywhere the sequence of escape+seperator is found, replace it with a GUID (you can use any other random crap in here, doesn't matter). Then use the built in split function. Then replace the guid in each element of the array with the escape+seperator.

BFree
After the split call, wouldn't you replace g with just the separator and not include the escape? That would save you the trouble of having to remove the escape from the returned string.
rjrapson
This is the classic "placeholder" pattern. I like the use of the GUID as the placeholder. I would say that this is good enough for "hobby" code, but not "Global Thermonuclear War" code.
tyndall
@rjrapson: Good point. I guess it depends on what the OP wanted. I guess you can extend this method to take a bool whether or not to include the escape character. @Bruno: The only real issue I see with this approach, is that a Guid includes a "-" which CAN be the separator.
BFree
+1  A: 

Here is solution if you want to remove the escape character.

public static IEnumerable<string> Split(this string input, 
                                        string separator, 
                                        char escapeCharacter) {
    string[] splitted = input.Split(new[] { separator });
    StringBuilder sb = null;

    foreach (string subString in splitted) {
        if (subString.EndsWith(escapeCharacter.ToString())) {
            if (sb == null)
                sb = new StringBuilder();
            sb.Append(subString, 0, subString.Length - 1);
        } else {
            if (sb == null)
                yield return subString;
            else {
                sb.Append(subString);
                yield return sb.ToString();
                sb = null;
            }
        }
    }
    if (sb != null)
        yield return sb.ToString();
}
chaowman
A: 
public string RemoveMultipleDelimiters(string sSingleLine)
        {
            string sMultipleDelimitersLine = "";
            string sMultipleDelimitersLine1 = "";
            int iDelimeterPosition = -1;
            iDelimeterPosition = sSingleLine.IndexOf('>');
            iDelimeterPosition = sSingleLine.IndexOf('>', iDelimeterPosition + 1);
            if (iDelimeterPosition > -1)
            {
                sMultipleDelimitersLine = sSingleLine.Substring(0, iDelimeterPosition - 1);
                sMultipleDelimitersLine1 = sSingleLine.Substring(sSingleLine.IndexOf('>', iDelimeterPosition) - 1);
                sMultipleDelimitersLine1 = sMultipleDelimitersLine1.Replace('>', '*');
                sSingleLine = sMultipleDelimitersLine + sMultipleDelimitersLine1;
            }
            return sSingleLine;
        }