tags:

views:

208

answers:

4

In one of my programs for school, I use the following function to count the frequency of identifiers in a string, separated by newlines and #:

Input:

dog
cat
mouse
#
rabbit
snake
#

Function:

//assume I have the proper includes, and am using namespace std
vector< pair<string,int> > getFreqcounts(string input) {
    vector<string> items = splitString(input,"\n");
    map<string,int> counts;

    for (int i=0; i<items.size(); i++) {
        if (items[i] == "#") continue;
        counts[items[i]] = 0;
    }
    for (int i=0; i<items.size(); i++) {
        if (items[i] == "#") continue;
        counts[items[i]]++;
    }

    return vector< pair<string,int> > (counts.begin(),counts.end());
}

I would like to at the very least

  • remove the double for loop
  • find a better way to get a vector< pair<string,int> >

Any ideas?

BTW, this is NOT homework. The real homework will use this function, but this is purely out of my own curiosity and desire to have "better" code.

+5  A: 

My understanding of std::map is such that the entire first loop can simply be eliminated. The first time you try to access a node that does not exist the map will default-create it for you, setting the initial count to zero (the default-ctor behavior of a builtin type.) That should be all the changes you need to make to your code, and the behavior should be the same.

Update: On a side note in the code you have provided counts will be sorted according to operator< defined for std::string (the key type for your map), which will sort the map nodes lexicographically. There's no need to pump the results through a vector and sort the vector - the map is handling this for you automatically.

fbrereto
Indeed it does. Thanks!
Austin Hyde
+2  A: 

You can get rid of the first for loop by simply deleting it. It accomplishes nothing useful. When/if the subscript into the map creates a new item, that item will have the chosen key, and your associated int will be initialized to zero automatically.

Personally, I'd probably do things a bit differently, using a stringstream instead of your SplitString(). I'm hesitant about posting code, but I guess I'll trust you...

typedef vector<pair<string, int> > count_vec;

count_vec GetFreqCounts(string const &input) { 
    istringstream in(input);
    string line;
    map<string, int> counts;

    while (getline(in, line))
        if (line != "#")
            ++counts[line];
    return count_vec(counts.begin(), counts.end());
}

Edit: I honestly didn't pay a whole lot of attention to efficiency as I was writing this, but I think Steve Jessop's comment on it is pretty accurate. As long as the input is small, it won't make any real difference. If the input is really big, the fact that this only uses an extra copy of one word at a time could save enough memory to be meaningful.

The solution Steve gave in his reply looks pretty nice too though. Since it also processes words as they're produced, I'd expect it to have characteristics similar to the code above. If you can break the string into words faster than stringstream does, it'll undoubtedly be faster. Given the number of virtual functions that get in the way with iostreams, there's a pretty good chance of that -- but unless you're dealing with a lot of text there's not much chance of it making a significant difference. Of course, exactly what qualifies as significant is open to question. To put it in perspective, I ran some similar code across a word list I had handy. Using code pretty close to what's above, it processes text at a little over 10 megabytes a second.

Jerry Coffin
Austin Hyde
It probably will be more efficient, yes. The reason is that your splitStrings constructs a honking great vector of strings, then goes back to the beginning and walks over it again. Jerry uses each string immediately, as soon as it's constructed. So it uses less memory, and for significant amounts of data Jerry's version should provoke fewer cache misses. Against that, `stringstream::getline()` may or may not be faster than whatever you're doing in splitString.
Steve Jessop
Also, taking input by reference will prevent a copy of it, but istringstream then copies it anyway. So modifying splitString to take a const reference might be fastest of all. Especially if you also modify splitString to take a functor specifying what to do with each line it finds, rather than returning the lines. You can then specify this functor to increment the count. Whether you should care sort of depends how big your input is likely to be - copying 2k of data is probably nothing. Copying 2GB kind of matters.
Steve Jessop
+1  A: 

In a comment to Jerry's answer I mentioned using a functor. Here's the sort of thing I mean (untested code):

struct StringCounter {
    std::map<std::string, int> counts;
    void operator()(const std::string &s) {
        ++counts[s];
    }
};

template <typename Output>
void splitString(const string &input, const string &separator, Output &out) {
    // do whatever you currently do to get each string, call it "s"...
    out(s);
    // lather, rinse, repeat
}

vector< pair<string,int> > getFreqcounts(const string &input) {
    StringCounter sc;
    splitString(input,"\n",sc);
    return vector< pair<string,int> > (sc.counts.begin(),sc.counts.end());
}
Steve Jessop
A: 

If you want to keep the current logic but contain all of it in a single loop, you can use find(), like so:

map<string,int> counts;
loop on curr ..
map<string,int>::iterator it = counts.find(curr);
if (it != counts.end())
  ++it->second;
else
  counts[curr] = 1;

But that's not what I would do. There's no need to explicitly initialize the map entry in the first time as it will default to zero anyway, so you can use:

map<string,int> counts;
loop on curr ..
  if (..
    ++map[curr];

Another thing i would do is use std::getline() with '\n' and '#' being the delimiters.

rmn