views:

116

answers:

4

(My apologies, this is a 2nd post to http://stackoverflow.com/questions/3390175/most-efficient-way-to-determine-if-a-string-length-0 but I can't figure out how to reply to people's answers, my reply becomes posted as an 'answer')

Ideally, what I'm looking for is the most efficient algorithm to do the following (which will be called 100million+ times). I'm using C# 4.0

Turn the string: "A B C D E " into the array: string["A","B","C","D","E"]

My algorithm is as follows:

public string[] SplitOnMultiSpaces(string text)
{
  if (string.IsNullOrEmpty(text)) return new string[0];

  var split = text.Split(' ');
  int length = split.Length;

  var data = new string[length];

  int index = 0;
  for (int i = 0; i<length; i++)
  {
    if (split[i].Length != 0)
    {
      data[index++] = split[i];
    }
  }

  return data;
}

My problem is when I profile this against 100,000 strings, it takes 1.04 seconds to execute.

If I comment out the "if (split[i].Length != 0)" check, it takes only 0.2 seconds.

Can anybody tell me why this (simple) query against the string is taking 80% of the TOTAL execution time? (Especially, since I expected other areas to use more CPU) The only idea I've come up w/ is C# is trying to count the string length, which people tell me is not the case (that it's more like VB strings I guess?). But that wouldn't make sense for the time overhead.

I've considered trying to see if split[i][0] exists, but relying on an exception slows things WAAAAAAY down.

P.S. -- My algorithm also suffers in that the returned array is, more often than not, bigger than it needs to be, but that doesn't seem to be too much of an overhead.

+3  A: 

Have compared performance using the String.Split overload that takes a StringSplitOptions that would make your empty string check unnecessary?

Nicole Calinoiu
+4  A: 

Likely to be as fast or faster than what you can do (without going into lower level code aka. C/C++).

// somewhere else
private static readonly char[] splitter =  new []{' '} ;

//
public string[] SplitOnMultiSpaces(string text)
{
    return text.Split(splitter, StringSplitOptions.RemoveEmptyEntries );
}
BioBuckyBall
+1 for directly returning the result of Split and saving redundant memory allocations and copies. Suggestion: make the `new [] {' '}` a static array instead of 'newing' it every time.
Moron
@Moron good suggestion.
BioBuckyBall
A: 

Incredible. The StringSplitOptions.RemoveEmptyEntries works perfect!

Thanks! Much, much faster now.

Eric
Eric: You should just mark the appropriate (best) answer as the answer (and maybe post a comment) instead of making a separate answer for this. Just hit the check mark near the "best" answer.
Reed Copsey
#meta: This is where you should be clicking "add comment" beneath another response.
roufamatic
A: 

When I benchmark this, in either debug or release mode, I get virtually identical runtimes regardless of whether "if (split[i].Length!=0)" is there or not, both corresponding to your quickest times. (Thus supporting the idea that Length is a quick check.) Is there something not shown that could be affecting the performance in some other way?

Having said that, I'd like to agree that StringSplitOptions.RemoveEmptyEntries is the best way to go. But I'm still curious as to why I can't reproduce the original behaviour.

TechNeilogy