ansaurus

Question

How do I split a string by strings and include the delimiters using .NET?

Answer 1

+5 A:

Ok, sorry, maybe this one:

    string source = "123xx456yy789";
    foreach (string delimiter in delimiters)
        source = source.Replace(delimiter, ";" + delimiter + ";");
    string[] parts = source.Split(';');

Nagg 2010-03-20 22:10:40

Fails for delimiters that include `;`.

mafutrct 2010-03-20 22:13:57

@mafutrct - he actually presented a workable idea, though. Perhaps have a list of possible *new* delimitters, could be one or more characters each. Iterate over the list, check if the possible delimitter exists, and use Nagg's logic for the first delimitter that passes the test.

Anthony Pegram 2010-03-20 22:17:19

True, but I'd really like a solution that is not dependent on the non-existance of certain delimiter literals in the string. I don't see how this is possible with this idea, except with some mapping that would likely hurt the performance too badly. I'm open for counter examples though, of course.

mafutrct 2010-03-20 22:23:00

Answer 2

+1 A:

A naive implementation

public IEnumerable<string> SplitX (string text, string[] delimiters)
{
    var split = text.Split (delimiters, StringSplitOptions.None);

    foreach (string part in split) {
        yield return part;
        text = text.Substring (part.Length);

        string delim = delimiters.FirstOrDefault (x => text.StartsWith (x));
        if (delim != null) {
            yield return delim;
            text = text.Substring (delim.Length);
        }
    }
}

mafutrct 2010-03-20 22:19:18

Answer 3

+12 A:

Despite your reluctance to use regex it actually nicely preserves the delimiters by using a group along with the Regex.Split method:

string input = "123xx456yy789";
string pattern = "(xx|yy)";
string[] result = Regex.Split(input, pattern);

If you remove the parentheses from the pattern, using just "xx|yy", the delimiters are not preserved. Be sure to use Regex.Escape on the pattern if you use any metacharacters that hold special meaning in regex. The characters include \, *, +, ?, |, {, [, (,), ^, $,., #. For instance, a delimiter of . should be escaped \.. Given a list of delimiters, you need to "OR" them using the pipe | symbol and that too is a character that gets escaped. To properly build the pattern use the following code (thanks to @gabe for pointing this out):

var delimiters = new List<string> { ".", "xx", "yy" };
string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                  + ")";

The parentheses are concatenated rather than included in the pattern since they would be incorrectly escaped for your purposes.

EDIT: In addition, if the delimiters list happens to be empty, the final pattern would incorrectly be () and this would cause blank matches. To prevent this a check for the delimiters can be used. With all this in mind the snippet becomes:

string input = "123xx456yy789";
// to reach the else branch set delimiters to new List();
var delimiters = new List<string> { ".", "xx", "yy", "()" }; 
if (delimiters.Count > 0)
{
    string pattern = "("
                     + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                     + ")";
    string[] result = Regex.Split(input, pattern);
    foreach (string s in result)
    {
        Console.WriteLine(s);
    }
}
else
{
    // nothing to split
    Console.WriteLine(input);
}

If you need a case-insensitive match for the delimiters use the RegexOptions.IgnoreCase option: Regex.Split(input, pattern, RegexOptions.IgnoreCase)

Ahmad Mageed 2010-03-20 22:34:43

+1 That's a nice solution. I do like regex, I just thought it's too big of a tool for a job so simple that a very similar version was included in .NET's string class.

mafutrct 2010-03-20 22:39:36

You would need to do `pattern = "(" + String.Join("|", (from d in delimeters select Regex.Escape(d)).ToArray()) + ")"` because any of the delimeters could have a `.` or `|` or whatever in them.

Gabe 2010-03-20 22:41:51

+1 I didn't know you could do that! Very nice. You just need to fix the Regex.Escape code...

Mark Byers 2010-03-20 22:42:17

@gabe good point, I missed that. Will edit now.

Ahmad Mageed 2010-03-20 22:44:08

@Mark thanks, and done :)

Ahmad Mageed 2010-03-20 22:56:29

@Ahmad - accept an array of delimiters and use Regex.Escape before building the pattern group.

Sky Sanders 2010-03-20 23:10:18

@Sky I'm not sure I follow your suggestion. Maybe you made it before seeing my last edit? Using an array would be similar to using a list; it would also need a `ToArray()` after the `Select` to go from an `IEnumerable<string>` to a `string[]` for `String.Join`.

Ahmad Mageed 2010-03-21 01:32:26

@Ahmed, yeah, maybe it was getting edited as I browsed the thread or I just glossed over what you had. who knows. In any case, you have implemented my intention.

Sky Sanders 2010-03-21 01:45:38

+1 - Usually I abhor RegEx. But your solution, along with the explanation as code comments, makes quite a bit of sense. Is really quite readable. And has clear intent, which is the key factor.

Metro Smurf 2010-03-21 17:07:47

Tiny flaw: Requires check for existence of delimiters to avoid weird results.

mafutrct 2010-03-21 21:38:16

@mafutrct can you give an example?

Ahmad Mageed 2010-03-21 22:13:55

@Ahmed: pattern = "()"

mafutrct 2010-03-22 09:22:03

@mafutrct: parentheses need to be escaped in regex patterns since they are used for grouping. It would need to be `pattern = @"";` or `pattern = Regex.Escape("()");` to get the same result. That will prevent any weirdness. Regex.Escape will escape: `\, *, +, ?, |, {, [, (,), ^, $,., #`. If the `delimiters` list is empty the final pattern build up will incorrectly be `()` so a check is needed to avoid splitting on an empty list: `if (delimiters.Count > 0) { // build pattern and then split, otherwise do nothing }`. That check is good to have in general.

Ahmad Mageed 2010-03-22 13:05:35

@mafutrct updated my post with a new snippet that reflects the previously mentioned points.

Ahmad Mageed 2010-03-22 13:30:58

I'm sorry, I was unclear. I specifically meant the case when pattern becomes "()" because no delimiters are given (the list of delimiters is empty). In this case, I guess the method should return a list containing exactly one element: the whole input string. This is really only a tiny detail about an unspecified edge case.

mafutrct 2010-03-22 14:11:07

@mafutrct that's fine I accounted for that scenario as well. Please see the if/else snippet I added previously. This handles an empty delimiter list.

Ahmad Mageed 2010-03-22 15:00:34

Answer 4

+2 A:

Here's a solution that doesn't use a regular expression and doesn't make more strings than necessary:

public static List<string> Split(string searchStr, string[] separators)
{
    List<string> result = new List<string>();
    int length = searchStr.Length;
    int lastMatchEnd = 0;
    for (int i = 0; i < length; i++)
    {
        for (int j = 0; j < separators.Length; j++)
        {
            string str = separators[j];
            int sepLen = str.Length;
            if (((searchStr[i] == str[0]) && (sepLen <= (length - i))) && ((sepLen == 1) || (String.CompareOrdinal(searchStr, i, str, 0, sepLen) == 0)))
            {
                result.Add(searchStr.Substring(lastMatchEnd, i - lastMatchEnd));
                result.Add(separators[j]);
                i += sepLen - 1;
                lastMatchEnd = i + 1;
                break;
            }
        }
    }
    if (lastMatchEnd != length)
        result.Add(searchStr.Substring(lastMatchEnd));
    return result;
}

Gabe 2010-03-20 22:35:16

I noticed this produces an output different from all others. Sometimes an item is missing, apparently.

mafutrct 2010-10-14 11:41:19

Answer 5

A:

Here's another method. I cannot vouch for its efficiency.

    static void Main()
    {
        string input = "123xx456yy789yy012";
        string[] delims = { "xx", "yy" };

        string[] array = input.Split(delims, StringSplitOptions.None);

        List<string> splitInput = new List<string>();
        StringBuilder builder = new StringBuilder(input);
        foreach (string val in array)
        {
            splitInput.Add(val);
            builder.Remove(0, val.Length);
            foreach (string delim in delims)
            {
                if (builder.ToString().StartsWith(delim))
                {
                    splitInput.Add(delim);
                    builder.Remove(0, delim.Length);
                    break;
                }
            }
        }

        foreach (string s in splitInput)
            Console.WriteLine(s);

        Console.Read();
    }

Edit: Slightly optimized it.

Anthony Pegram 2010-03-20 22:42:29

Answer 6

+1 A:

I came up with a solution for something similar a while back. To efficiently split a string you can keep a list of the next occurance of each delimiter. That way you minimise the times that you have to look for each delimiter.

This algorithm will perform well even for a long string and a large number of delimiters:

string input = "123xx456yy789";
string[] delimiters = { "xx", "yy" };

int[] nextPosition = delimiters.Select(d => input.IndexOf(d)).ToArray();
List<string> result = new List<string>();
int pos = 0;
while (true) {
  int firstPos = int.MaxValue;
  string delimiter = null;
  for (int i = 0; i < nextPosition.Length; i++) {
    if (nextPosition[i] != -1 && nextPosition[i] < firstPos) {
      firstPos = nextPosition[i];
      delimiter = delimiters[i];
    }
  }
  if (firstPos != int.MaxValue) {
    result.Add(input.Substring(pos, firstPos - pos));
    result.Add(delimiter);
    pos = firstPos + delimiter.Length;
    for (int i = 0; i < nextPosition.Length; i++) {
      if (nextPosition[i] != -1 && nextPosition[i] < pos) {
        nextPosition[i] = input.IndexOf(delimiters[i], pos);
      }
    }
  } else {
    result.Add(input.Substring(pos));
    break;
  }
}

(With reservations for any bugs, I just threw this version together now and I haven't tested it thorougly.)

Guffa 2010-03-20 23:17:40

Seems to work fine for standard input.

mafutrct 2010-03-21 21:41:03

Answer 7

A:

This will have identical semantics to String.Split default mode (so not including empty tokens).

It can be made faster by using unsafe code to iterate over the source string, though this requires you to write the iteration mechanism yourself rather than using yield return. It allocates the absolute minimum (a substring per non separator token plus the wrapping enumerator) so realistically to improve performance you would have to:

use even more unsafe code (by using 'CompareOrdinal' I effectively am)
- mainly in avoiding the overhead of character lookup on the string with a char buffer
make use of domain specific knowledge about the input sources or tokens.
- you may be happy to eliminate the null check on the separators
- you may know that the separators are almost never individual characters

The code is written as an extension method

public static IEnumerable<string> SplitWithTokens(
    string str,
    string[] separators)
{
    if (separators == null || separators.Length == 0)
    {
        yield return str;
        yield break;
    }
    int prev = 0;
    for (int i = 0; i < str.Length; i++)
    {
        foreach (var sep in separators)
        {
            if (!string.IsNullOrEmpty(sep))
            {
                if (((str[i] == sep[0]) && 
                          (sep.Length <= (str.Length - i))) 
                     &&
                    ((sep.Length == 1) || 
                    (string.CompareOrdinal(str, i, sep, 0, sep.Length) == 0)))
                {
                    if (i - prev != 0)
                        yield return str.Substring(prev, i - prev);
                    yield return sep;
                    i += sep.Length - 1;
                    prev = i + 1;
                    break;
                }
            }
        }
    }
    if (str.Length - prev > 0)
        yield return str.Substring(prev, str.Length - prev);
}

ShuggyCoUk 2010-03-21 01:23:42

ah - realised I am similar to gabe in implementation. Mine saves some allocations but is fundamentally the same concept.

ShuggyCoUk 2010-03-21 01:26:30

How does your implementation save allocations?

Gabe 2010-03-21 05:41:09

@gabe I do not create sub strings for the separator tokens, a minor improvement trivial to add to yours (which I see you have already sone)

ShuggyCoUk 2010-03-21 17:18:32

Yes, but your `foreach` loop allocates a new enumerator for the separator array for every character of the input string.

Gabe 2010-03-21 18:40:28

@gabe foreach on a (compile time known) array does not allocate an enumerator. Try it and see.

ShuggyCoUk 2010-03-21 22:36:12

Indeed, you're right. Cool!

Gabe 2010-03-22 02:05:07

Answer 8

+1 A:

My first post/answer...this is a recursive approach.

    static void Split(string src, string[] delims, ref List<string> final)
    {
        if (src.Length == 0)
            return;

        int endTrimIndex = src.Length;
        foreach (string delim in delims)
        {
            //get the index of the first occurance of this delim
            int indexOfDelim = src.IndexOf(delim);
            //check to see if this delim is at the begining of src
            if (indexOfDelim == 0)
            {
                endTrimIndex = delim.Length;
                break;
            }
            //see if this delim comes before previously searched delims
            else if (indexOfDelim < endTrimIndex && indexOfDelim != -1)
                endTrimIndex = indexOfDelim;
        }
        final.Add(src.Substring(0, endTrimIndex));
        Split(src.Remove(0, endTrimIndex), delims, ref final);
    }

2010-03-21 02:23:29

ansaurus

tags:

views:

answers:

How do I split a string by strings and include the delimiters using .NET?

related questions