views:

336

answers:

4

I was asked this during an interview and apparently it's an easy question but it wasn't and still isn't obvious to me.

Given a string, count all the words in it. Doesn't matter if they are repeated. Just the total count like in a text files word count. Words are anything separated by a space and punctuation doesn't matter, as long as it's part of a word.

For example: A very, very, very, very, very big dog ate my homework!!!! ==> 11 words

My "algorithm" just goes through looking for spaces and incrementing a counter until I hit a null. Since i didn't get the job and was asked to leave after that I guess My solution wasn't good? Anyone have a more clever solution? Am I missing something?

+15  A: 

Assuming words are white space separated:

unsigned int countWordsInString(std::string const& str)
{
    std::stringstream stream(str);
    return std::distance(std::istream_iterator<std::string>(stream), std::istream_iterator<std::string>());
}

Note: There may be more than one space between words. Also this does not catch other white space characters like tab new line or carriage return. So counting spaces is not enough.

The stream input operator >> when used to read a string from a stream. Reads one white space separated word. So they were probably looking for you to use this to identify words.

std::stringstream  stream(str);
std::string        oneWord;

stream >> oneWord; // Reads one space separated word.

When can use this to count words in a string.

std::stringstream  stream(str);
std::string        oneWord;
unsigned int       count = 0;

while(stream >> oneWord) { ++count;}
// count now has the number of words in the string.

Getting complicated:
Streams can be treated just like any other container and there are iterators to loop through them std::istream_iterator. When you use the ++ operator on an istream_iterator it just read the next value from the stream using the operator >>. In this case we are reading std::string so it reads a space separated word.

std::stringstream  stream(str);
std::string        oneWord;
unsigned int       count = 0;

std::istream_iterator loop = std::istream_iterator<std::string>(stream);
std::istream_iterator end  = std::istream_iterator<std::string>();

for(;loop != end; ++count, ++loop) { *loop; }

Using std::distance just wraps all the above in a tidy package as it find the distance between two iterators by doing ++ on the first until we reach the second.

To avoid copying the string we can be sneaky:

unsigned int countWordsInString(std::string const& str)
{
    std::stringstream stream;

    // sneaky way to use the string as the buffer to avoid copy.
    stream.rdbuf()->pubsetbuf (str.c_str(), str.length() );
    return std::distance(std::istream_iterator<std::string>(stream), std::istream_iterator<std::string>());
}

Note: we still copy each word out of the original into a temporary. But the cost of that is minimal.

Martin York
Would +1 for cleverness but out of votes -- note however that this results in copying the underlying string at least twice as well as several memory allocations, plus several virtual function calls as a result of running on top of iostreams.
Billy ONeal
@ Billy ONeal: True. But I bet for a string under 1K it is indistinguishable in time over just one run. Plus it is so easy to read and understand that I am willing to pay this cost.
Martin York
@Billy ONeal: Added sneaky way to avoid the main string copy. Copying each word into a temporary string has minimal cost.
Martin York
*Easy to read* for whom? The author or the unknown maintainer? (It could be that a place of employ demands that folks know the C++ standard library, but I have yet to be employed by such a facility. :) ) (Note that I think this is a neat solution, but have been so far away from proper C++ for so long that it'd take me by surprise to run across it.)
dash-tom-bang
@dash-tom-bang: Err.. it's unreasonable to ask for a C++ programmer and expect them not to know C++. The standard library is a part of C++.
Billy ONeal
@dash-tom-bang: I hope it is easy for anybody to read. The whole point of a standard library is that everybody use rather than inventing their own. Thus even if you did not know about distance you could look it up in twenty seconds see it us using iterators deduce that we were iterating across a stream and bobs your uncle. Code deduced in 30 seconds.
Martin York
My "objection" is only to the implication that every company has the same needs and requirements as your own. My industry typically uses a very strict subset of C++, and it is surprising to find anyone who knows `<algorithm>` at all, much less in overview. "Standard library" in my industry means "probably useful to people with different requirements than ours."
dash-tom-bang
Note that `setbuf` with nonzero arguments is implementation-defined for all standard `streambuf` classes including `stringbuf`. It might maybe work with all SGI-derived iostreams, but… `strstream` is probably more portable.
Potatoswatter
Martin, thank for the very detailed answer, but I don't think this was what the interviewer had in mind. It may very well be much better than the other solutions but only the c++ rockstars are gone grok this, which i can assure you my interviewer was not (and neither am I for that matter)
eviljack
@eviljack: Different interviews are different.. When I was shopping around in Manhattan some 5 years ago, in some places the basic trivia questions included things like "what is Koenig lookup?" and "when are streambuf iterators better than stream iterators?".
Cubbi
This is the kind of question I ask to find out if the interview does know the STL. A good candidate will do this (or at least use streams to split the input into words (as that is very common)). The best ones will use std::distanace() the good ones will use a loop the ones we don't hire will be the ones that parse a word by themselves.
Martin York
+4  A: 

A less clever, more obvious-to-all-of-the-programmers-on-your-team method of doing it.

#include <cctype>

int CountWords(const char* str)
{
   if (str == NULL)
      return error_condition;  // let the requirements define this...

   bool inSpaces = true;
   int numWords = 0;

   while (*str != NULL)
   {
      if (std::isspace(*str))
      {
         inSpaces = true;
      }
      else if (inSpaces)
      {
         numWords++;
         inSpaces = false;
      }

      ++str;
   }

   return numWords;
}
dash-tom-bang
Its relatively standard to use the stream operator >> to get words. I don't see how this is more obvious as it takes a while to read all this code and understand it.
Martin York
@dash-tom-bang: My bad. I don't see why immediately why it works but testing it on codepad seems to work. I think it's unintuitive though.
Billy ONeal
@dash-tom-bang: I deleted my answer because I found some cases where it gives wrong answers. That said, I still think the design was more intuitive than this one.
Billy ONeal
It may be standard to use >> to get words, but for those of us who never deal with text files the standard is irrelevant. The point of my opening statement is that this code requires only that the user understands very basic parts of the language.
dash-tom-bang
@Billy intuitive is also a personal value. :) Regardless, we should all strive to avoid doing (virtually) the same check in multiple places, because it is a vector for errors. Also- choosing names that impart meaning rather than state implementation details is helpful to future generations.
dash-tom-bang
This is the answer I'd like to see as an interviewer. When asking someone to implement a trivial algorithm in an interview, the interviewer is usually trying to see if the person can write a piece of low-level code sanely without introducing unnecessary overhead, complexity, or obfuscation. They are not attempting to engage in a dick-measuring contest over obscure library features or challenge the interviewee to a game of code golf.
blucz
This is similar to what I had. Guess we would both be no hires for that company. Lol.
eviljack
Lol, +1 for not engaging in dick measuring contests!
eviljack
That should be inSpaces=false towards the end. Did u test it?
eviljack
@eviljack yes thanks. I renamed the variable after posting to be more appropriate, and missed one spot. sry.
dash-tom-bang
@eviljack yeah I suspect that the requirements of my job and of Martin's are very different.
dash-tom-bang
+1  A: 

This can be done without manually looking at every character or copying the string.

#include <boost/iterator/transform_iterator.hpp>
#include <cctype>

boost::transform_iterator
    < int (*)(int), std::string::const_iterator, bool const& >
    pen( str.begin(), std::isalnum ), end( str.end(), std::isalnum );

size_t word_cnt = 0;

while ( pen != end ) {
    word_cnt += * pen;
    pen = std::mismatch( pen+1, end, pen ).first;
}

return word_cnt;

I took the liberty of using isalnum instead of isspace.

This is not something I would do at a job interview. (It's not like it compiled the first time.)

Or, for all the Boost haters ;v)

if ( str.empty() ) return 0;

size_t word_cnt = std::isalnum( * str.begin() );

for ( std::string::const_iterator pen = str.begin(); ++ pen != str.end(); ) {
    word_cnt += std::isalnum( pen[ 0 ] ) && ! std::isalnum( pen[ -1 ] );
}

return word_cnt;
Potatoswatter
this works even better with `std::string::const_iterator`
Cubbi
Also, everything except `size_t word_cnt…` and `return…` could go inside a `for` loop.
Potatoswatter
This answer is impressive, but basically unreadable, and requires an unwieldy 3rd party library notorious for exploding build times. If someone attempted this on an interview I'd probably pass on them.
blucz
Yeah the isalnum vs isspace is a question open due to the ambiguous nature of the OP's post; I parsed it as "punctuation is not a space so it counts as a word character."
dash-tom-bang
@blucz: With a few comments, it wouldn't be so bad, since the job of the `transform_iterator` is pretty straightforward. Using `mismatch` to find state transitions is generally useful. Some C++ programmers are also picky about whether lighter-weight Boost components are allowed…
Potatoswatter
A: 

Another boost based solution that may work (untested):

vector<string> result;
split(result, "aaaa bbbb cccc", is_any_of(" \t\n\v\f\r"), token_compress_on);

More information can be found in the Boost String Algorithms Library

Christopher Hunt
While it works, it's not exactly what I'd be looking for in an interview.
Billy ONeal
Why not? Unless the interviewer specifically asks for an implementation that doesn't use a large amount of temporary storage, or that it shouldn't use boost, then I think this is a very acceptable answer. It's certainly the most readable of all of the offered solutions and I think is a good example of idiomatic C++.
the_mandrill
I would also probably go on to ask them how they might implement it without using a vector of strings and without boost. A good candidate would be able to offer both kinds of solution.
the_mandrill
"While it works, it's not exactly what I'd be looking for in an interview." - you mean you'd be looking for someone then who wants to always writes things the long way. ;-) It would be useful to me for you to elaborate on your comment.
Christopher Hunt