views:

172

answers:

4

I am not sure how to go about this. Right now I am counting the spaces to get the word count of my string but if there is a double space the word count will be inaccurate. Is there a better way to do this?

+2  A: 

This seems to work for me:

var input = "This is a  test";
var count = input.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Length;
Luke Quinane
Splitting has a huge overhead associated with it. A loop through the characters will be the fastest path. OTOH, RegEx will likely be slower than Split since the resulting array is of a less complex type (string) than a Match object will be.
richardtallent
I call bullsh*t on richard's statement, turns out split is actually faster in some cases and only marginally slower than the foreach solution.
Sam Saffron
+1  A: 

Try string.Split:

string sentence = "This     is a sentence     with  some spaces.";
string[] words = sentence.Split(new char[] { ' ' },  StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;
bobbymcr
+2  A: 

While solutions based on Split are short to write, they might get expensive, as all the string objects need to be created and then thrown away. I would expect that an explicit algorithm such as

  static int CountWords(string s)
  {
    int words = 0;
    bool inword = false;
    for(int i=0; i < s.Length; i++) {
      switch(s[i]) {
      case ' ':case '\t':case '\r':case '\n':
          if(inword)words++;
          inword = false;
          break;
      default:
          inword = true;
          break;
      }
    }
    if(inword)words++;
    return words;
  }

is more efficient (plus it can also consider additional whitespace characters).

Martin v. Löwis
If you're going that route, I would suggest a while loop, an index variable and string.IndexOfAny(char[], int).
bobbymcr
@bobbymcr I don't see how that would help. Presumably IndexOfAny is just linearly searching.
Robert Paulson
On my system, the IndexOfAny approach is indeed faster. Linear scan + switch = between 0.014 and 0.015 ms; while + IndexOfAny = between 0.011 and 0.012 ms. Code here: http://pastebin.ca/1551255 This hardly matters, but I wanted to be sure! :)
bobbymcr
Whoops, there was a bug in the previous code... corrected: http://pastebin.ca/1551268. Perf results are the same.
bobbymcr
@bobbymcr: I can't reproduce your results. The case version runs 0.008ms, the indexof version 0.016ms, using VS 2008. If I make the string much longer (e.g. doubling it 10 times), the indexof version consistently takes twice as long.
Martin v. Löwis
+4  A: 

Alternate Version of @Martin v. Löwis, which uses a foreach and char.IsWhiteSpace() which should be more correct when dealing with other cultures.

int CountWithForeach(string para)
{
    bool inWord = false;
    int words = 0;
    foreach (char c in para)
    {
     if (char.IsWhiteSpace(c))
     {
      if( inWord )
       words++;
      inWord = false;
      continue;
     }
     inWord = true;
    }
    if( inWord )
     words++;

    return words;
}
Robert Paulson
How many words is "run-of-the-mill"? (just something to think about)
280Z28
@280Z28 - you bring up a good point. fyi even MS Word treats that as 1 word.
Robert Paulson
what about "one.two three;four"
codeulike
@codeulike Yes another good point. As this was only meant to be an alternate, more culturally aware version of another posters answer, I'll leave it as-is. In a real program I'd have unit tests to indicate exactly what I expected. You could say this is more of an approach to solving the answer for yourself. The benefits, as stated elsewhere, are that it works in a single pass and has minimal memory pressure.
Robert Paulson