There are many similar questions, but apparently no perfect match, that's why I'm asking.
I'd like to split a random string (e.g. 123xx456yy789
) by a list of string delimiters (e.g. xx
, yy
) and include the delimiters in the result (here: 123
, xx
, 456
, yy
, 789
).
Good performance is a nice bonus. Regex should be avoided, if possible.
Update: I did some performance checks and compared the results (too lazy to formally check them though). The tested solutions are (in random order):
Other solutions were not tested because either they were similar to another solution or they came in too late.
This is the test code:
class Program
{
private static readonly List<Func<string, List<string>, List<string>>> Functions;
private static readonly List<string> Sources;
private static readonly List<List<string>> Delimiters;
static Program ()
{
Functions = new List<Func<string, List<string>, List<string>>> ();
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ());
Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ());
Sources = new List<string> ();
Sources.Add ("");
Sources.Add (Guid.NewGuid ().ToString ());
string str = "";
for (int outer = 0; outer < 10; outer++) {
for (int i = 0; i < 10; i++) {
str += i + "**" + DateTime.UtcNow.Ticks;
}
str += "-";
}
Sources.Add (str);
Delimiters = new List<List<string>> ();
Delimiters.Add (new List<string> () { });
Delimiters.Add (new List<string> () { "-" });
Delimiters.Add (new List<string> () { "**" });
Delimiters.Add (new List<string> () { "-", "**" });
}
private class Result
{
public readonly int FuncID;
public readonly int SrcID;
public readonly int DelimID;
public readonly long Milliseconds;
public readonly List<string> Output;
public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output)
{
FuncID = funcID;
SrcID = srcID;
DelimID = delimID;
Milliseconds = milliseconds;
Output = output;
}
public void Print ()
{
Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms");
Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ()));
}
}
static void Main (string[] args)
{
var results = new List<Result> ();
for (int srcID = 0; srcID < 3; srcID++) {
for (int delimID = 0; delimID < 4; delimID++) {
for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests
Stopwatch sw = new Stopwatch ();
sw.Start ();
var func = Functions[funcId];
var src = Sources[srcID];
var del = Delimiters[delimID];
for (int i = 0; i < 10000; i++) {
func (src, del);
}
var list = func (src, del);
sw.Stop ();
var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list);
results.Add (res);
res.Print ();
}
}
}
}
}
As you can see, it was really just a quick and dirty test, but I ran the test multiple times and with different order and the result was always very consistent. The measured time frames are in the range of milliseconds up to seconds for the larger datasets. I ignored the values in the low-millisecond range in my following evaluation because they seemed negligible in practice. Here's the output on my box:
S 0 D 0 F 3 11ms 1 S 0 D 0 F 2 7ms 1 S 0 D 0 F 1 6ms 1 S 0 D 0 F 0 4ms 0 S 0 D 1 F 3 28ms 1 S 0 D 1 F 2 8ms 1 S 0 D 1 F 1 7ms 1 S 0 D 1 F 0 3ms 0 S 0 D 2 F 3 30ms 1 S 0 D 2 F 2 8ms 1 S 0 D 2 F 1 6ms 1 S 0 D 2 F 0 3ms 0 S 0 D 3 F 3 30ms 1 S 0 D 3 F 2 10ms 1 S 0 D 3 F 1 8ms 1 S 0 D 3 F 0 3ms 0 S 1 D 0 F 3 9ms 1 9e5282ec-e2a2-4... S 1 D 0 F 2 6ms 1 9e5282ec-e2a2-4... S 1 D 0 F 1 5ms 1 9e5282ec-e2a2-4... S 1 D 0 F 0 5ms 1 9e5282ec-e2a2-4... S 1 D 1 F 3 63ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 2 37ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 1 29ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 1 F 0 22ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 2 F 3 30ms 1 9e5282ec-e2a2-4... S 1 D 2 F 2 10ms 1 9e5282ec-e2a2-4... S 1 D 2 F 1 10ms 1 9e5282ec-e2a2-4... S 1 D 2 F 0 12ms 1 9e5282ec-e2a2-4... S 1 D 3 F 3 73ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 2 40ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 1 33ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 1 D 3 F 0 30ms 9 9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37 S 2 D 0 F 3 10ms 1 0**634226552821... S 2 D 0 F 2 109ms 1 0**634226552821... S 2 D 0 F 1 5ms 1 0**634226552821... S 2 D 0 F 0 127ms 1 0**634226552821... S 2 D 1 F 3 184ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 2 364ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 1 134ms 21 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 1 F 0 517ms 20 0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226 552821... - 0**634226552821... - S 2 D 2 F 3 688ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 2 2404ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 1 874ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 2 F 0 717ms 201 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 3 1205ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 2 3471ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 1 1008ms 221 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... ** S 2 D 3 F 0 1095ms 220 0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6 34226552821217... **
I compared the results and this is what I found:
- All 4 functions are fast enough for common usage.
- The naive version (aka what I wrote initially) is the worst in terms of computation time.
- Regex is a bit slow on small datasets (probably due to initialization overhead).
- Regex does well on large data and hits a similar speed as the non-regex solutions.
- The performance-wise best seems to be Guffa's version overall, which is to be expected from the code.
- Gabe's version sometimes omits an item, but I did not investigate this (bug?).
To conclude this topic, I suggest to use Regex, which is reasonably fast. If performance is critical, I'd prefer Guffa's implementation.