ansaurus

Question

What is a Surefire way to get a string Word Count in C#

Answer 1

+2 A:

This seems to work for me:

var input = "This is a  test";
var count = input.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Length;

Luke Quinane 2009-09-02 03:58:31

Splitting has a huge overhead associated with it. A loop through the characters will be the fastest path. OTOH, RegEx will likely be slower than Split since the resulting array is of a less complex type (string) than a Match object will be.

richardtallent 2009-09-02 05:15:21

I call bullsh*t on richard's statement, turns out split is actually faster in some cases and only marginally slower than the foreach solution.

Sam Saffron 2009-09-02 08:20:51

Answer 2

+1 A:

Try string.Split:

string sentence = "This     is a sentence     with  some spaces.";
string[] words = sentence.Split(new char[] { ' ' },  StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;

bobbymcr 2009-09-02 03:59:12

Answer 3

+2 A:

While solutions based on Split are short to write, they might get expensive, as all the string objects need to be created and then thrown away. I would expect that an explicit algorithm such as

  static int CountWords(string s)
  {
    int words = 0;
    bool inword = false;
    for(int i=0; i < s.Length; i++) {
      switch(s[i]) {
      case ' ':case '\t':case '\r':case '\n':
          if(inword)words++;
          inword = false;
          break;
      default:
          inword = true;
          break;
      }
    }
    if(inword)words++;
    return words;
  }

is more efficient (plus it can also consider additional whitespace characters).

Martin v. Löwis 2009-09-02 04:14:49

If you're going that route, I would suggest a while loop, an index variable and string.IndexOfAny(char[], int).

bobbymcr 2009-09-02 04:19:33

@bobbymcr I don't see how that would help. Presumably IndexOfAny is just linearly searching.

Robert Paulson 2009-09-02 05:02:51

On my system, the IndexOfAny approach is indeed faster. Linear scan + switch = between 0.014 and 0.015 ms; while + IndexOfAny = between 0.011 and 0.012 ms. Code here: http://pastebin.ca/1551255 This hardly matters, but I wanted to be sure! :)

bobbymcr 2009-09-02 07:49:24

Whoops, there was a bug in the previous code... corrected: http://pastebin.ca/1551268. Perf results are the same.

bobbymcr 2009-09-02 08:02:10

@bobbymcr: I can't reproduce your results. The case version runs 0.008ms, the indexof version 0.016ms, using VS 2008. If I make the string much longer (e.g. doubling it 10 times), the indexof version consistently takes twice as long.

Martin v. Löwis 2009-09-02 16:12:06

Answer 4

+4 A:

Alternate Version of @Martin v. Löwis, which uses a foreach and char.IsWhiteSpace() which should be more correct when dealing with other cultures.

int CountWithForeach(string para)
{
    bool inWord = false;
    int words = 0;
    foreach (char c in para)
    {
     if (char.IsWhiteSpace(c))
     {
      if( inWord )
       words++;
      inWord = false;
      continue;
     }
     inWord = true;
    }
    if( inWord )
     words++;

    return words;
}

Robert Paulson 2009-09-02 05:05:13

How many words is "run-of-the-mill"? (just something to think about)

280Z28 2009-09-02 09:05:20

@280Z28 - you bring up a good point. fyi even MS Word treats that as 1 word.

Robert Paulson 2009-09-02 21:01:52

what about "one.two three;four"

codeulike 2010-10-07 21:28:50

@codeulike Yes another good point. As this was only meant to be an alternate, more culturally aware version of another posters answer, I'll leave it as-is. In a real program I'd have unit tests to indicate exactly what I expected. You could say this is more of an approach to solving the answer for yourself. The benefits, as stated elsewhere, are that it works in a single pass and has minimal memory pressure.

Robert Paulson 2010-10-08 04:26:54

ansaurus

tags:

views:

answers:

What is a Surefire way to get a string Word Count in C#

related questions