ansaurus

Question

C++ function to count all the words in a string

Answer 1

+15 A:

Assuming words are white space separated:

unsigned int countWordsInString(std::string const& str)
{
    std::stringstream stream(str);
    return std::distance(std::istream_iterator<std::string>(stream), std::istream_iterator<std::string>());
}

Note: There may be more than one space between words. Also this does not catch other white space characters like tab new line or carriage return. So counting spaces is not enough.

The stream input operator >> when used to read a string from a stream. Reads one white space separated word. So they were probably looking for you to use this to identify words.

std::stringstream  stream(str);
std::string        oneWord;

stream >> oneWord; // Reads one space separated word.

When can use this to count words in a string.

std::stringstream  stream(str);
std::string        oneWord;
unsigned int       count = 0;

while(stream >> oneWord) { ++count;}
// count now has the number of words in the string.

Getting complicated:
Streams can be treated just like any other container and there are iterators to loop through them std::istream_iterator. When you use the ++ operator on an istream_iterator it just read the next value from the stream using the operator >>. In this case we are reading std::string so it reads a space separated word.

std::stringstream  stream(str);
std::string        oneWord;
unsigned int       count = 0;

std::istream_iterator loop = std::istream_iterator<std::string>(stream);
std::istream_iterator end  = std::istream_iterator<std::string>();

for(;loop != end; ++count, ++loop) { *loop; }

Using std::distance just wraps all the above in a tidy package as it find the distance between two iterators by doing ++ on the first until we reach the second.

To avoid copying the string we can be sneaky:

unsigned int countWordsInString(std::string const& str)
{
    std::stringstream stream;

    // sneaky way to use the string as the buffer to avoid copy.
    stream.rdbuf()->pubsetbuf (str.c_str(), str.length() );
    return std::distance(std::istream_iterator<std::string>(stream), std::istream_iterator<std::string>());
}

Note: we still copy each word out of the original into a temporary. But the cost of that is minimal.

Martin York 2010-09-08 22:07:37

Would +1 for cleverness but out of votes -- note however that this results in copying the underlying string at least twice as well as several memory allocations, plus several virtual function calls as a result of running on top of iostreams.

Billy ONeal 2010-09-08 22:29:05

@ Billy ONeal: True. But I bet for a string under 1K it is indistinguishable in time over just one run. Plus it is so easy to read and understand that I am willing to pay this cost.

Martin York 2010-09-08 22:32:21

@Billy ONeal: Added sneaky way to avoid the main string copy. Copying each word into a temporary string has minimal cost.

Martin York 2010-09-08 22:38:27

*Easy to read* for whom? The author or the unknown maintainer? (It could be that a place of employ demands that folks know the C++ standard library, but I have yet to be employed by such a facility. :) ) (Note that I think this is a neat solution, but have been so far away from proper C++ for so long that it'd take me by surprise to run across it.)

dash-tom-bang 2010-09-08 22:45:58

@dash-tom-bang: Err.. it's unreasonable to ask for a C++ programmer and expect them not to know C++. The standard library is a part of C++.

Billy ONeal 2010-09-08 22:50:30

@dash-tom-bang: I hope it is easy for anybody to read. The whole point of a standard library is that everybody use rather than inventing their own. Thus even if you did not know about distance you could look it up in twenty seconds see it us using iterators deduce that we were iterating across a stream and bobs your uncle. Code deduced in 30 seconds.

Martin York 2010-09-08 22:50:41

My "objection" is only to the implication that every company has the same needs and requirements as your own. My industry typically uses a very strict subset of C++, and it is surprising to find anyone who knows `<algorithm>` at all, much less in overview. "Standard library" in my industry means "probably useful to people with different requirements than ours."

dash-tom-bang 2010-09-08 22:57:46

Note that `setbuf` with nonzero arguments is implementation-defined for all standard `streambuf` classes including `stringbuf`. It might maybe work with all SGI-derived iostreams, but… `strstream` is probably more portable.

Potatoswatter 2010-09-09 03:20:50

Martin, thank for the very detailed answer, but I don't think this was what the interviewer had in mind. It may very well be much better than the other solutions but only the c++ rockstars are gone grok this, which i can assure you my interviewer was not (and neither am I for that matter)

eviljack 2010-09-09 12:29:55

@eviljack: Different interviews are different.. When I was shopping around in Manhattan some 5 years ago, in some places the basic trivia questions included things like "what is Koenig lookup?" and "when are streambuf iterators better than stream iterators?".

Cubbi 2010-09-09 13:01:47

This is the kind of question I ask to find out if the interview does know the STL. A good candidate will do this (or at least use streams to split the input into words (as that is very common)). The best ones will use std::distanace() the good ones will use a loop the ones we don't hire will be the ones that parse a word by themselves.

Martin York 2010-09-09 15:15:08

Answer 2

+4 A:

A less clever, more obvious-to-all-of-the-programmers-on-your-team method of doing it.

#include <cctype>

int CountWords(const char* str)
{
   if (str == NULL)
      return error_condition;  // let the requirements define this...

   bool inSpaces = true;
   int numWords = 0;

   while (*str != NULL)
   {
      if (std::isspace(*str))
      {
         inSpaces = true;
      }
      else if (inSpaces)
      {
         numWords++;
         inSpaces = false;
      }

      ++str;
   }

   return numWords;
}

dash-tom-bang 2010-09-08 22:30:58

Its relatively standard to use the stream operator >> to get words. I don't see how this is more obvious as it takes a while to read all this code and understand it.

Martin York 2010-09-08 22:46:14

@dash-tom-bang: My bad. I don't see why immediately why it works but testing it on codepad seems to work. I think it's unintuitive though.

Billy ONeal 2010-09-08 22:46:14

@dash-tom-bang: I deleted my answer because I found some cases where it gives wrong answers. That said, I still think the design was more intuitive than this one.

Billy ONeal 2010-09-08 22:48:11

It may be standard to use >> to get words, but for those of us who never deal with text files the standard is irrelevant. The point of my opening statement is that this code requires only that the user understands very basic parts of the language.

dash-tom-bang 2010-09-08 22:52:01

@Billy intuitive is also a personal value. :) Regardless, we should all strive to avoid doing (virtually) the same check in multiple places, because it is a vector for errors. Also- choosing names that impart meaning rather than state implementation details is helpful to future generations.

dash-tom-bang 2010-09-08 22:53:23

This is the answer I'd like to see as an interviewer. When asking someone to implement a trivial algorithm in an interview, the interviewer is usually trying to see if the person can write a piece of low-level code sanely without introducing unnecessary overhead, complexity, or obfuscation. They are not attempting to engage in a dick-measuring contest over obscure library features or challenge the interviewee to a game of code golf.

blucz 2010-09-08 23:59:19

This is similar to what I had. Guess we would both be no hires for that company. Lol.

eviljack 2010-09-09 02:31:33

Lol, +1 for not engaging in dick measuring contests!

eviljack 2010-09-09 11:48:47

That should be inSpaces=false towards the end. Did u test it?

eviljack 2010-09-09 12:34:52

@eviljack yes thanks. I renamed the variable after posting to be more appropriate, and missed one spot. sry.

dash-tom-bang 2010-09-09 19:17:00

@eviljack yeah I suspect that the requirements of my job and of Martin's are very different.

dash-tom-bang 2010-09-09 19:23:31

Answer 3

+1 A:

This can be done without manually looking at every character or copying the string.

#include <boost/iterator/transform_iterator.hpp>
#include <cctype>

boost::transform_iterator
    < int (*)(int), std::string::const_iterator, bool const& >
    pen( str.begin(), std::isalnum ), end( str.end(), std::isalnum );

size_t word_cnt = 0;

while ( pen != end ) {
    word_cnt += * pen;
    pen = std::mismatch( pen+1, end, pen ).first;
}

return word_cnt;

I took the liberty of using isalnum instead of isspace.

This is not something I would do at a job interview. (It's not like it compiled the first time.)

Or, for all the Boost haters ;v)

if ( str.empty() ) return 0;

size_t word_cnt = std::isalnum( * str.begin() );

for ( std::string::const_iterator pen = str.begin(); ++ pen != str.end(); ) {
    word_cnt += std::isalnum( pen[ 0 ] ) && ! std::isalnum( pen[ -1 ] );
}

return word_cnt;

Potatoswatter 2010-09-08 23:17:52

this works even better with `std::string::const_iterator`

Cubbi 2010-09-08 23:32:59

Also, everything except `size_t word_cnt…` and `return…` could go inside a `for` loop.

Potatoswatter 2010-09-08 23:41:19

This answer is impressive, but basically unreadable, and requires an unwieldy 3rd party library notorious for exploding build times. If someone attempted this on an interview I'd probably pass on them.

blucz 2010-09-08 23:47:20

Yeah the isalnum vs isspace is a question open due to the ambiguous nature of the OP's post; I parsed it as "punctuation is not a space so it counts as a word character."

dash-tom-bang 2010-09-08 23:52:05

@blucz: With a few comments, it wouldn't be so bad, since the job of the `transform_iterator` is pretty straightforward. Using `mismatch` to find state transitions is generally useful. Some C++ programmers are also picky about whether lighter-weight Boost components are allowed…

Potatoswatter 2010-09-08 23:55:20

Answer 4

A:

Another boost based solution that may work (untested):

vector<string> result;
split(result, "aaaa bbbb cccc", is_any_of(" \t\n\v\f\r"), token_compress_on);

More information can be found in the Boost String Algorithms Library

Christopher Hunt 2010-09-09 02:01:36

While it works, it's not exactly what I'd be looking for in an interview.

Billy ONeal 2010-09-09 02:02:35

Why not? Unless the interviewer specifically asks for an implementation that doesn't use a large amount of temporary storage, or that it shouldn't use boost, then I think this is a very acceptable answer. It's certainly the most readable of all of the offered solutions and I think is a good example of idiomatic C++.

the_mandrill 2010-09-09 19:45:09

I would also probably go on to ask them how they might implement it without using a vector of strings and without boost. A good candidate would be able to offer both kinds of solution.

the_mandrill 2010-09-09 19:46:29

"While it works, it's not exactly what I'd be looking for in an interview." - you mean you'd be looking for someone then who wants to always writes things the long way. ;-) It would be useful to me for you to elaborate on your comment.

Christopher Hunt 2010-09-09 23:53:48

ansaurus

tags:

views:

answers:

C++ function to count all the words in a string

related questions