views:

463

answers:

8

Hi all,

I have the following already sorted data:

AAA
AAA
TCG
TTT
TTT
TTT

I want to count the occurrence of each string yielding

AAA 2
TCG 1
TTT 3

I know I can do that with "uniq -c", but here I need to do extra processing on the overall C++ code I have.

I am stuck with this construct (modified according to 'pgras' suggestion)

#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;


int main  ( int arg_count, char *arg_vec[] ) {
    if (arg_count !=2 ) {
        cerr << "expected one argument" << endl;
        return EXIT_FAILURE;
    }

    string line;
    ifstream myfile (arg_vec[1]);


    if (myfile.is_open())
    {
        int count;
        string lastTag = "";

        while (getline(myfile,line) )
        {
            stringstream ss(line);
            string Tag;

            ss >> Tag; // read first column
            //cout << Tag << endl; 

            if (Tag != lastTag) {
               lastTag = Tag;
               count = 0;
            }
            else {
                count++;
            }

             cout << lastTag << " " << count << endl;
        }
        cout << lastTag << " " << count << endl;
        myfile.close();

    }
    else {cout << "Unable to open file";}
    return 0;
}

It prints this instead:

AAA 0
AAA 1
TCT 0
TTT 0
TTT 1
TTT 2
TTT 2
+1  A: 

Your code looks slightly broken syntactically (the ifstream, ...), but the overall algorithm I think is sound. Read lines, and increment a counter every time the line is the same as the one before. There might be some boundary conditions to consider (what if the input is only one line?), but you'll catch those during testing.

unwind
And remember to start with -1 for the initial item, or else the question is slightly erroneous. ;)That said, the other answers until now are not as efficient.
Arafangion
+1  A: 

The use of the stringstream just to get the tag seems a bit of overkill - I'd probably use string::substr. That aside, what do you think is wrong with your code? What do you want to improve?

Edit: Next thing, we will be getting downvoted for breathing...

anon
+1, not sure why this was downvoted...
j_random_hacker
+6  A: 

If you just want to print it out, your algorithm is ok. If you want to pass it to another function, you can use for example STL map.

map<string, int> dict;

while(getline(myfile,line)) {
          string Tag;
          stringstream ss(line);
          ss >> Tag;
          if (dict.count(Tag) == 0) 
               dict[Tag] = 1;
           else
               dict[Tag]++;
}
vartec
You don't need the extra `if` inside the loop. The [] operator creates a default-constructed item if none exists.
Konrad Rudolph
+6  A: 

You have to reset counter when tag is different from lastTag, and increment if it's the same... When the tag is different you can handle the previous tag with it's associated count value (before you reset count)...

pgras
@pgras: I modified, but maybe I still dont' get you.
neversaint
Hello see Svante answer, it's exactly what I meant...
pgras
+4  A: 

Use something like this:

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <iterator>


std::ostream& operator << ( std::ostream& out, const std::pair< std::string, size_t >& rhs )
{
    out << rhs.first << ", " << rhs.second;
    return out;
}

int main() 
{
    std::ifstream inp( "mysorted_data.txt" );
    std::string str;
    std::map < std::string, size_t > words_count;
    while ( inp >> str )
    {
     words_count[str]++;
    }

    std::copy( 
     words_count.begin(), 
     words_count.end(), 
     std::ostream_iterator< std::pair< std::string, size_t > >( std::cout, "\n" ) );

    return 0;
}
bb
+4  A: 

Assuming that your data indeed consists of DNA strings of length 3 (or more general length N where N is quite small), you can make this very efficient by using a q-gram table which is a specialized hash table with a table size of 4N and the following hashing function:

// Disregard error codes.
int char2dna_lookup[] = {
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x0  – 0xF
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x10 – 0x1F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x20 – 0x2F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x30 – 0x3F
    0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A    – P
    0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // Q    – 0x5F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x60 – 0x6F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x70 – 0x7F
}

unsigned int hash(string const& dna) {
    unsigned int ret = 0;

    for (unsigned int i = 0; i < dna.length(); ++i)
        ret = ret * 4 + char2dna_lookup[dna[i]];

    return ret;
}

You can now index your array very efficiently.

ifstream ifs("data.txt");
string line;

if (not ifs >> line)
    exit(1);

unsigned* frequencies = new unsigned int[line.length()];

frequencies[hash(line)] = 1;

while (ifs >> line)
    ++frequencies[hash(line)];

// Print the frequencies …

delete[] frequencies;

Alternatively, use a library such as SeqAn for such tasks.

Konrad Rudolph
Notice, the code is untested. There might be errors in the lookup table (or elsewhere).
Konrad Rudolph
could you post an article showing which technice is this?
Edison Gustavo Muenz
+2  A: 

I think that all you have to do is replace this

        if (Tag != lastTag) {
           lastTag = Tag;
           count = 0;
        }
        else {
            count++;
        }

        cout << lastTag << " " << count << endl;

with this:

        if (Tag != lastTag) {
            if (lastTag != "") {  // don't print initial empty tag
                cout << lastTag << " " << count << endl;
            }
            lastTag = Tag;
            count = 1; // count current
          } else {
            count++;
        }
Svante
+1  A: 
#include <map>
#include <string>
#include <algorithm>
#include <iterator>
#include <iostream>

class Counter
{   private: std::map<std::string,int>&   m_count;
    public:  Counter(std::map<std::string,int>& data) :m_count(data){}
        void operator()(std::string const& word)
        {
            m_count[word]++;
        }};
class Printer
{   private: std::ostream& m_out;
    public:  Printer(std::ostream& out) :m_out(out) {}
        void operator()(std::map<std::string,int>::value_type const& data)
        {
            m_out << data.first << " = " << data.second << "\n";
        }};

int main()
{
    std::map<std::string,int>       count;

    for_each(std::istream_iterator<std::string>(std::cin),
             std::istream_iterator<std::string>(),
             Counter(count)
            );

    for_each(count.begin(),count.end(),
             Printer(std::cout)
            );
}
Martin York