ansaurus

Question

Algorithm for Counting Sorted Strings (Homebrew "uniq -c")

Answer 1

+1 A:

Your code looks slightly broken syntactically (the ifstream, ...), but the overall algorithm I think is sound. Read lines, and increment a counter every time the line is the same as the one before. There might be some boundary conditions to consider (what if the input is only one line?), but you'll catch those during testing.

unwind 2009-03-11 09:41:13

And remember to start with -1 for the initial item, or else the question is slightly erroneous. ;)That said, the other answers until now are not as efficient.

Arafangion 2009-03-11 10:16:27

Answer 2

+1 A:

The use of the stringstream just to get the tag seems a bit of overkill - I'd probably use string::substr. That aside, what do you think is wrong with your code? What do you want to improve?

Edit: Next thing, we will be getting downvoted for breathing...

anon 2009-03-11 09:43:58

+1, not sure why this was downvoted...

j_random_hacker 2009-03-11 09:59:15

Answer 3

+6 A:

If you just want to print it out, your algorithm is ok. If you want to pass it to another function, you can use for example STL map.

map<string, int> dict;

while(getline(myfile,line)) {
          string Tag;
          stringstream ss(line);
          ss >> Tag;
          if (dict.count(Tag) == 0) 
               dict[Tag] = 1;
           else
               dict[Tag]++;
}

vartec 2009-03-11 09:48:58

You don't need the extra `if` inside the loop. The [] operator creates a default-constructed item if none exists.

Konrad Rudolph 2009-03-11 21:19:57

Answer 4

+6 A:

You have to reset counter when tag is different from lastTag, and increment if it's the same... When the tag is different you can handle the previous tag with it's associated count value (before you reset count)...

pgras 2009-03-11 09:49:33

@pgras: I modified, but maybe I still dont' get you.

neversaint 2009-03-11 10:28:39

Hello see Svante answer, it's exactly what I meant...

pgras 2009-03-12 07:30:16

Answer 5

+4 A:

Use something like this:

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <iterator>


std::ostream& operator << ( std::ostream& out, const std::pair< std::string, size_t >& rhs )
{
    out << rhs.first << ", " << rhs.second;
    return out;
}

int main() 
{
    std::ifstream inp( "mysorted_data.txt" );
    std::string str;
    std::map < std::string, size_t > words_count;
    while ( inp >> str )
    {
     words_count[str]++;
    }

    std::copy( 
     words_count.begin(), 
     words_count.end(), 
     std::ostream_iterator< std::pair< std::string, size_t > >( std::cout, "\n" ) );

    return 0;
}

bb 2009-03-11 10:25:48

Answer 6

+4 A:

Assuming that your data indeed consists of DNA strings of length 3 (or more general length N where N is quite small), you can make this very efficient by using a q-gram table which is a specialized hash table with a table size of 4^N and the following hashing function:

// Disregard error codes.
int char2dna_lookup[] = {
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x0  – 0xF
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x10 – 0x1F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x20 – 0x2F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x30 – 0x3F
    0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A    – P
    0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // Q    – 0x5F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x60 – 0x6F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x70 – 0x7F
}

unsigned int hash(string const& dna) {
    unsigned int ret = 0;

    for (unsigned int i = 0; i < dna.length(); ++i)
        ret = ret * 4 + char2dna_lookup[dna[i]];

    return ret;
}

You can now index your array very efficiently.

ifstream ifs("data.txt");
string line;

if (not ifs >> line)
    exit(1);

unsigned* frequencies = new unsigned int[line.length()];

frequencies[hash(line)] = 1;

while (ifs >> line)
    ++frequencies[hash(line)];

// Print the frequencies …

delete[] frequencies;

Alternatively, use a library such as SeqAn for such tasks.

Konrad Rudolph 2009-03-11 10:33:50

Notice, the code is untested. There might be errors in the lookup table (or elsewhere).

Konrad Rudolph 2009-03-11 10:34:43

could you post an article showing which technice is this?

Edison Gustavo Muenz 2009-03-11 13:04:29

Answer 7

+2 A:

I think that all you have to do is replace this

        if (Tag != lastTag) {
           lastTag = Tag;
           count = 0;
        }
        else {
            count++;
        }

        cout << lastTag << " " << count << endl;

with this:

        if (Tag != lastTag) {
            if (lastTag != "") {  // don't print initial empty tag
                cout << lastTag << " " << count << endl;
            }
            lastTag = Tag;
            count = 1; // count current
          } else {
            count++;
        }

Svante 2009-03-11 16:23:12

Answer 8

+1 A:

#include <map>
#include <string>
#include <algorithm>
#include <iterator>
#include <iostream>

class Counter
{   private: std::map<std::string,int>&   m_count;
    public:  Counter(std::map<std::string,int>& data) :m_count(data){}
        void operator()(std::string const& word)
        {
            m_count[word]++;
        }};
class Printer
{   private: std::ostream& m_out;
    public:  Printer(std::ostream& out) :m_out(out) {}
        void operator()(std::map<std::string,int>::value_type const& data)
        {
            m_out << data.first << " = " << data.second << "\n";
        }};

int main()
{
    std::map<std::string,int>       count;

    for_each(std::istream_iterator<std::string>(std::cin),
             std::istream_iterator<std::string>(),
             Counter(count)
            );

    for_each(count.begin(),count.end(),
             Printer(std::cout)
            );
}

Martin York 2009-03-11 18:52:25

ansaurus

tags:

views:

answers:

Algorithm for Counting Sorted Strings (Homebrew "uniq -c")

related questions