I am not sure how to go about this. Right now I am counting the spaces to get the word count of my string but if there is a double space the word count will be inaccurate. Is there a better way to do this?
+2
A:
This seems to work for me:
var input = "This is a test";
var count = input.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Length;
Luke Quinane
2009-09-02 03:58:31
Splitting has a huge overhead associated with it. A loop through the characters will be the fastest path. OTOH, RegEx will likely be slower than Split since the resulting array is of a less complex type (string) than a Match object will be.
richardtallent
2009-09-02 05:15:21
I call bullsh*t on richard's statement, turns out split is actually faster in some cases and only marginally slower than the foreach solution.
Sam Saffron
2009-09-02 08:20:51
+1
A:
Try string.Split:
string sentence = "This is a sentence with some spaces.";
string[] words = sentence.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;
bobbymcr
2009-09-02 03:59:12
+2
A:
While solutions based on Split are short to write, they might get expensive, as all the string objects need to be created and then thrown away. I would expect that an explicit algorithm such as
static int CountWords(string s)
{
int words = 0;
bool inword = false;
for(int i=0; i < s.Length; i++) {
switch(s[i]) {
case ' ':case '\t':case '\r':case '\n':
if(inword)words++;
inword = false;
break;
default:
inword = true;
break;
}
}
if(inword)words++;
return words;
}
is more efficient (plus it can also consider additional whitespace characters).
Martin v. Löwis
2009-09-02 04:14:49
If you're going that route, I would suggest a while loop, an index variable and string.IndexOfAny(char[], int).
bobbymcr
2009-09-02 04:19:33
@bobbymcr I don't see how that would help. Presumably IndexOfAny is just linearly searching.
Robert Paulson
2009-09-02 05:02:51
On my system, the IndexOfAny approach is indeed faster. Linear scan + switch = between 0.014 and 0.015 ms; while + IndexOfAny = between 0.011 and 0.012 ms. Code here: http://pastebin.ca/1551255 This hardly matters, but I wanted to be sure! :)
bobbymcr
2009-09-02 07:49:24
Whoops, there was a bug in the previous code... corrected: http://pastebin.ca/1551268. Perf results are the same.
bobbymcr
2009-09-02 08:02:10
@bobbymcr: I can't reproduce your results. The case version runs 0.008ms, the indexof version 0.016ms, using VS 2008. If I make the string much longer (e.g. doubling it 10 times), the indexof version consistently takes twice as long.
Martin v. Löwis
2009-09-02 16:12:06
+4
A:
Alternate Version of @Martin v. Löwis, which uses a foreach
and char.IsWhiteSpace()
which should be more correct when dealing with other cultures.
int CountWithForeach(string para)
{
bool inWord = false;
int words = 0;
foreach (char c in para)
{
if (char.IsWhiteSpace(c))
{
if( inWord )
words++;
inWord = false;
continue;
}
inWord = true;
}
if( inWord )
words++;
return words;
}
Robert Paulson
2009-09-02 05:05:13
@280Z28 - you bring up a good point. fyi even MS Word treats that as 1 word.
Robert Paulson
2009-09-02 21:01:52
@codeulike Yes another good point. As this was only meant to be an alternate, more culturally aware version of another posters answer, I'll leave it as-is. In a real program I'd have unit tests to indicate exactly what I expected. You could say this is more of an approach to solving the answer for yourself. The benefits, as stated elsewhere, are that it works in a single pass and has minimal memory pressure.
Robert Paulson
2010-10-08 04:26:54