views:

599

answers:

8

There are many similar questions, but apparently no perfect match, that's why I'm asking.

I'd like to split a random string (e.g. 123xx456yy789) by a list of string delimiters (e.g. xx, yy) and include the delimiters in the result (here: 123, xx, 456, yy, 789).

Good performance is a nice bonus. Regex should be avoided, if possible.

Update: I did some performance checks and compared the results (too lazy to formally check them though). The tested solutions are (in random order):

  1. Gabe
  2. Guffa
  3. Mafu
  4. Regex

Other solutions were not tested because either they were similar to another solution or they came in too late.

This is the test code:

class Program
{
    private static readonly List<Func<string, List<string>, List<string>>> Functions;
    private static readonly List<string> Sources;
    private static readonly List<List<string>> Delimiters;

    static Program ()
    {
        Functions = new List<Func<string, List<string>, List<string>>> ();
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ());

        Sources = new List<string> ();
        Sources.Add ("");
        Sources.Add (Guid.NewGuid ().ToString ());

        string str = "";
        for (int outer = 0; outer < 10; outer++) {
            for (int i = 0; i < 10; i++) {
                str += i + "**" + DateTime.UtcNow.Ticks;
            }
            str += "-";
        }
        Sources.Add (str);

        Delimiters = new List<List<string>> ();
        Delimiters.Add (new List<string> () { });
        Delimiters.Add (new List<string> () { "-" });
        Delimiters.Add (new List<string> () { "**" });
        Delimiters.Add (new List<string> () { "-", "**" });
    }

    private class Result
    {
        public readonly int FuncID;
        public readonly int SrcID;
        public readonly int DelimID;
        public readonly long Milliseconds;
        public readonly List<string> Output;

        public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output)
        {
            FuncID = funcID;
            SrcID = srcID;
            DelimID = delimID;
            Milliseconds = milliseconds;
            Output = output;
        }

        public void Print ()
        {
            Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms");
            Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ()));
        }
    }

    static void Main (string[] args)
    {
        var results = new List<Result> ();

        for (int srcID = 0; srcID < 3; srcID++) {
            for (int delimID = 0; delimID < 4; delimID++) {
                for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests
                    Stopwatch sw = new Stopwatch ();
                    sw.Start ();

                    var func = Functions[funcId];
                    var src = Sources[srcID];
                    var del = Delimiters[delimID];

                    for (int i = 0; i < 10000; i++) {
                        func (src, del);
                    }
                    var list = func (src, del);
                    sw.Stop ();

                    var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list);
                    results.Add (res);
                    res.Print ();
                }
            }
        }
    }
}

As you can see, it was really just a quick and dirty test, but I ran the test multiple times and with different order and the result was always very consistent. The measured time frames are in the range of milliseconds up to seconds for the larger datasets. I ignored the values in the low-millisecond range in my following evaluation because they seemed negligible in practice. Here's the output on my box:

S 0     D 0     F 3     11ms
1
S 0     D 0     F 2     7ms
1
S 0     D 0     F 1     6ms
1
S 0     D 0     F 0     4ms
0
S 0     D 1     F 3     28ms
1
S 0     D 1     F 2     8ms
1
S 0     D 1     F 1     7ms
1
S 0     D 1     F 0     3ms
0
S 0     D 2     F 3     30ms
1
S 0     D 2     F 2     8ms
1
S 0     D 2     F 1     6ms
1
S 0     D 2     F 0     3ms
0
S 0     D 3     F 3     30ms
1
S 0     D 3     F 2     10ms
1
S 0     D 3     F 1     8ms
1
S 0     D 3     F 0     3ms
0
S 1     D 0     F 3     9ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 2     6ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 1     5ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 0     5ms
1       9e5282ec-e2a2-4...
S 1     D 1     F 3     63ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 2     37ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 1     29ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 0     22ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 2     F 3     30ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 2     10ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 1     10ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 0     12ms
1       9e5282ec-e2a2-4...
S 1     D 3     F 3     73ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 2     40ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 1     33ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 0     30ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 2     D 0     F 3     10ms
1       0**634226552821...
S 2     D 0     F 2     109ms
1       0**634226552821...
S 2     D 0     F 1     5ms
1       0**634226552821...
S 2     D 0     F 0     127ms
1       0**634226552821...
S 2     D 1     F 3     184ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 2     364ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 1     134ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 0     517ms
20      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 2     F 3     688ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 2     2404ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 1     874ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 0     717ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 3     1205ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 2     3471ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 1     1008ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 0     1095ms
220     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **

I compared the results and this is what I found:

  • All 4 functions are fast enough for common usage.
  • The naive version (aka what I wrote initially) is the worst in terms of computation time.
  • Regex is a bit slow on small datasets (probably due to initialization overhead).
  • Regex does well on large data and hits a similar speed as the non-regex solutions.
  • The performance-wise best seems to be Guffa's version overall, which is to be expected from the code.
  • Gabe's version sometimes omits an item, but I did not investigate this (bug?).

To conclude this topic, I suggest to use Regex, which is reasonably fast. If performance is critical, I'd prefer Guffa's implementation.

+5  A: 

Ok, sorry, maybe this one:

    string source = "123xx456yy789";
    foreach (string delimiter in delimiters)
        source = source.Replace(delimiter, ";" + delimiter + ";");
    string[] parts = source.Split(';');
Nagg
Fails for delimiters that include `;`.
mafutrct
@mafutrct - he actually presented a workable idea, though. Perhaps have a list of possible *new* delimitters, could be one or more characters each. Iterate over the list, check if the possible delimitter exists, and use Nagg's logic for the first delimitter that passes the test.
Anthony Pegram
True, but I'd really like a solution that is not dependent on the non-existance of certain delimiter literals in the string. I don't see how this is possible with this idea, except with some mapping that would likely hurt the performance too badly. I'm open for counter examples though, of course.
mafutrct
+1  A: 

A naive implementation

public IEnumerable<string> SplitX (string text, string[] delimiters)
{
    var split = text.Split (delimiters, StringSplitOptions.None);

    foreach (string part in split) {
        yield return part;
        text = text.Substring (part.Length);

        string delim = delimiters.FirstOrDefault (x => text.StartsWith (x));
        if (delim != null) {
            yield return delim;
            text = text.Substring (delim.Length);
        }
    }
}
mafutrct
+12  A: 

Despite your reluctance to use regex it actually nicely preserves the delimiters by using a group along with the Regex.Split method:

string input = "123xx456yy789";
string pattern = "(xx|yy)";
string[] result = Regex.Split(input, pattern);

If you remove the parentheses from the pattern, using just "xx|yy", the delimiters are not preserved. Be sure to use Regex.Escape on the pattern if you use any metacharacters that hold special meaning in regex. The characters include \, *, +, ?, |, {, [, (,), ^, $,., #. For instance, a delimiter of . should be escaped \.. Given a list of delimiters, you need to "OR" them using the pipe | symbol and that too is a character that gets escaped. To properly build the pattern use the following code (thanks to @gabe for pointing this out):

var delimiters = new List<string> { ".", "xx", "yy" };
string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                  + ")";

The parentheses are concatenated rather than included in the pattern since they would be incorrectly escaped for your purposes.

EDIT: In addition, if the delimiters list happens to be empty, the final pattern would incorrectly be () and this would cause blank matches. To prevent this a check for the delimiters can be used. With all this in mind the snippet becomes:

string input = "123xx456yy789";
// to reach the else branch set delimiters to new List();
var delimiters = new List<string> { ".", "xx", "yy", "()" }; 
if (delimiters.Count > 0)
{
    string pattern = "("
                     + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                     + ")";
    string[] result = Regex.Split(input, pattern);
    foreach (string s in result)
    {
        Console.WriteLine(s);
    }
}
else
{
    // nothing to split
    Console.WriteLine(input);
}

If you need a case-insensitive match for the delimiters use the RegexOptions.IgnoreCase option: Regex.Split(input, pattern, RegexOptions.IgnoreCase)

Ahmad Mageed
+1 That's a nice solution. I do like regex, I just thought it's too big of a tool for a job so simple that a very similar version was included in .NET's string class.
mafutrct
You would need to do `pattern = "(" + String.Join("|", (from d in delimeters select Regex.Escape(d)).ToArray()) + ")"` because any of the delimeters could have a `.` or `|` or whatever in them.
Gabe
+1 I didn't know you could do that! Very nice. You just need to fix the Regex.Escape code...
Mark Byers
@gabe good point, I missed that. Will edit now.
Ahmad Mageed
@Mark thanks, and done :)
Ahmad Mageed
@Ahmad - accept an array of delimiters and use Regex.Escape before building the pattern group.
Sky Sanders
@Sky I'm not sure I follow your suggestion. Maybe you made it before seeing my last edit? Using an array would be similar to using a list; it would also need a `ToArray()` after the `Select` to go from an `IEnumerable<string>` to a `string[]` for `String.Join`.
Ahmad Mageed
@Ahmed, yeah, maybe it was getting edited as I browsed the thread or I just glossed over what you had. who knows. In any case, you have implemented my intention.
Sky Sanders
+1 - Usually I abhor RegEx. But your solution, along with the explanation as code comments, makes quite a bit of sense. Is really quite readable. And has clear intent, which is the key factor.
Metro Smurf
Tiny flaw: Requires check for existence of delimiters to avoid weird results.
mafutrct
@mafutrct can you give an example?
Ahmad Mageed
@Ahmed: pattern = "()"
mafutrct
@mafutrct: parentheses need to be escaped in regex patterns since they are used for grouping. It would need to be `pattern = @"\(\)";` or `pattern = Regex.Escape("()");` to get the same result. That will prevent any weirdness. Regex.Escape will escape: `\, *, +, ?, |, {, [, (,), ^, $,., #`. If the `delimiters` list is empty the final pattern build up will incorrectly be `()` so a check is needed to avoid splitting on an empty list: `if (delimiters.Count > 0) { // build pattern and then split, otherwise do nothing }`. That check is good to have in general.
Ahmad Mageed
@mafutrct updated my post with a new snippet that reflects the previously mentioned points.
Ahmad Mageed
I'm sorry, I was unclear. I specifically meant the case when pattern becomes "()" because no delimiters are given (the list of delimiters is empty). In this case, I guess the method should return a list containing exactly one element: the whole input string. This is really only a tiny detail about an unspecified edge case.
mafutrct
@mafutrct that's fine I accounted for that scenario as well. Please see the if/else snippet I added previously. This handles an empty delimiter list.
Ahmad Mageed
+2  A: 

Here's a solution that doesn't use a regular expression and doesn't make more strings than necessary:

public static List<string> Split(string searchStr, string[] separators)
{
    List<string> result = new List<string>();
    int length = searchStr.Length;
    int lastMatchEnd = 0;
    for (int i = 0; i < length; i++)
    {
        for (int j = 0; j < separators.Length; j++)
        {
            string str = separators[j];
            int sepLen = str.Length;
            if (((searchStr[i] == str[0]) && (sepLen <= (length - i))) && ((sepLen == 1) || (String.CompareOrdinal(searchStr, i, str, 0, sepLen) == 0)))
            {
                result.Add(searchStr.Substring(lastMatchEnd, i - lastMatchEnd));
                result.Add(separators[j]);
                i += sepLen - 1;
                lastMatchEnd = i + 1;
                break;
            }
        }
    }
    if (lastMatchEnd != length)
        result.Add(searchStr.Substring(lastMatchEnd));
    return result;
}
Gabe
I noticed this produces an output different from all others. Sometimes an item is missing, apparently.
mafutrct
A: 

Here's another method. I cannot vouch for its efficiency.

    static void Main()
    {
        string input = "123xx456yy789yy012";
        string[] delims = { "xx", "yy" };

        string[] array = input.Split(delims, StringSplitOptions.None);

        List<string> splitInput = new List<string>();
        StringBuilder builder = new StringBuilder(input);
        foreach (string val in array)
        {
            splitInput.Add(val);
            builder.Remove(0, val.Length);
            foreach (string delim in delims)
            {
                if (builder.ToString().StartsWith(delim))
                {
                    splitInput.Add(delim);
                    builder.Remove(0, delim.Length);
                    break;
                }
            }
        }

        foreach (string s in splitInput)
            Console.WriteLine(s);

        Console.Read();
    }

Edit: Slightly optimized it.

Anthony Pegram
+1  A: 

I came up with a solution for something similar a while back. To efficiently split a string you can keep a list of the next occurance of each delimiter. That way you minimise the times that you have to look for each delimiter.

This algorithm will perform well even for a long string and a large number of delimiters:

string input = "123xx456yy789";
string[] delimiters = { "xx", "yy" };

int[] nextPosition = delimiters.Select(d => input.IndexOf(d)).ToArray();
List<string> result = new List<string>();
int pos = 0;
while (true) {
  int firstPos = int.MaxValue;
  string delimiter = null;
  for (int i = 0; i < nextPosition.Length; i++) {
    if (nextPosition[i] != -1 && nextPosition[i] < firstPos) {
      firstPos = nextPosition[i];
      delimiter = delimiters[i];
    }
  }
  if (firstPos != int.MaxValue) {
    result.Add(input.Substring(pos, firstPos - pos));
    result.Add(delimiter);
    pos = firstPos + delimiter.Length;
    for (int i = 0; i < nextPosition.Length; i++) {
      if (nextPosition[i] != -1 && nextPosition[i] < pos) {
        nextPosition[i] = input.IndexOf(delimiters[i], pos);
      }
    }
  } else {
    result.Add(input.Substring(pos));
    break;
  }
}

(With reservations for any bugs, I just threw this version together now and I haven't tested it thorougly.)

Guffa
Seems to work fine for standard input.
mafutrct
A: 

This will have identical semantics to String.Split default mode (so not including empty tokens).

It can be made faster by using unsafe code to iterate over the source string, though this requires you to write the iteration mechanism yourself rather than using yield return. It allocates the absolute minimum (a substring per non separator token plus the wrapping enumerator) so realistically to improve performance you would have to:

  • use even more unsafe code (by using 'CompareOrdinal' I effectively am)
    • mainly in avoiding the overhead of character lookup on the string with a char buffer
  • make use of domain specific knowledge about the input sources or tokens.
    • you may be happy to eliminate the null check on the separators
    • you may know that the separators are almost never individual characters

The code is written as an extension method

public static IEnumerable<string> SplitWithTokens(
    string str,
    string[] separators)
{
    if (separators == null || separators.Length == 0)
    {
        yield return str;
        yield break;
    }
    int prev = 0;
    for (int i = 0; i < str.Length; i++)
    {
        foreach (var sep in separators)
        {
            if (!string.IsNullOrEmpty(sep))
            {
                if (((str[i] == sep[0]) && 
                          (sep.Length <= (str.Length - i))) 
                     &&
                    ((sep.Length == 1) || 
                    (string.CompareOrdinal(str, i, sep, 0, sep.Length) == 0)))
                {
                    if (i - prev != 0)
                        yield return str.Substring(prev, i - prev);
                    yield return sep;
                    i += sep.Length - 1;
                    prev = i + 1;
                    break;
                }
            }
        }
    }
    if (str.Length - prev > 0)
        yield return str.Substring(prev, str.Length - prev);
}
ShuggyCoUk
ah - realised I am similar to gabe in implementation. Mine saves some allocations but is fundamentally the same concept.
ShuggyCoUk
How does your implementation save allocations?
Gabe
@gabe I do not create sub strings for the separator tokens, a minor improvement trivial to add to yours (which I see you have already sone)
ShuggyCoUk
Yes, but your `foreach` loop allocates a new enumerator for the separator array for every character of the input string.
Gabe
@gabe foreach on a (compile time known) array does not allocate an enumerator. Try it and see.
ShuggyCoUk
Indeed, you're right. Cool!
Gabe
+1  A: 

My first post/answer...this is a recursive approach.

    static void Split(string src, string[] delims, ref List<string> final)
    {
        if (src.Length == 0)
            return;

        int endTrimIndex = src.Length;
        foreach (string delim in delims)
        {
            //get the index of the first occurance of this delim
            int indexOfDelim = src.IndexOf(delim);
            //check to see if this delim is at the begining of src
            if (indexOfDelim == 0)
            {
                endTrimIndex = delim.Length;
                break;
            }
            //see if this delim comes before previously searched delims
            else if (indexOfDelim < endTrimIndex && indexOfDelim != -1)
                endTrimIndex = indexOfDelim;
        }
        final.Add(src.Substring(0, endTrimIndex));
        Split(src.Remove(0, endTrimIndex), delims, ref final);
    }