views:

4442

answers:

10

How could I go about keeping track of the number of times a word appears in a textfile? I would like to do this for every word.

For example, if the input is something like:

"the man said hi to the boy."

Each of "man said hi to boy" would have an occurrence of 1.

"the" would have an occurence of 2.

I was thinking of keeping a dictionary with word/occurrence pairs but I'm not sure how to implement this in C. A link to any similar or related problems with a solution would be great.


EDIT: To avoid rolling out my own hash table I decided to learn how to use glib. Along the way I found an excellent tutorial which walks through a similar problem. http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

I am awestruck by the number of different approaches, and in particular the simplicity and elegance of the Ruby implementation.

+1  A: 

You could use a hash table and have every entry in the hash table point to a structure containing the word and number of times it has been found so far.

Jared
Would it be possible to have a collision where two different words hash to the same entry? Would I have to do some checking at the entry or does a perfect hash function exist? I'm a little rusty but I will do my research. Thanks
vinc456
That's the usual approach. You would need to insure against collisions---often by making each hash bucket a linked list of word-count structures. +1
dmckee
Wasn't this at +1 when I last saw it? Why would someone downvote a correct answer? :P +1 from me.
ShreevatsaR
+4  A: 

Yes, a dictionary with word-occurence pairs would work fine, and the usual way to implement such a dictionary would be to use a hash table (or, sometimes, a binary search tree).

You could also use a trie (or its compressed version, "Patricia trie"/Radix trie) whose complexity is asymptotically optimal for this problem, although I suspect that in practice it might be slower than a (good) hash table implementation.

[I really think whether hash tables or tries are better depends on the distribution of words in your input -- e.g. a hash table will need to store each word in its hash bucket (to guard against collisions), while if you have a lot of words with common prefixes, in a trie those common prefixes are shared and need to be stored only once each, but there is still the overhead of all the pointers... if you do happen to try both, I'm curious to know how they compare.]

ShreevatsaR
A: 

WARNING untested code:

#include <stdio.h>

struct LLNode
{
    LLNode* Next;    
    char*   Word;
    int     Count;
};

void PushWord(LLNode** list, const char* word)
{
    LLNode* node = NULL;
    unsigned int len = 0;
    if (*list == NULL) 
    {
        $list = new LLNode;
        $list = "\0";
    }
    node = *list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        if (!strcmp(node->Word, word))
        {
            node->Count++;
            break;
        }

        if (!node->Next)
        {
            LLNode* nnode = new LLNode;
            nnode->Count = 1;
            node->Next = nnode;
            len = strlen(word);
            node->Word = new char[len + 1];
            strcpy(node->Word, word);
            break;
        }
    }
}

void GetCounts(LLNode* list)
{
    if (!list)
        return;
    LLNode* node = list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        printf("Word: %s, Count: %i", node->Word, node->Count);
    }
}

void PushWords(LLNode** list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    char buff[len]; // to be sure we have no buffer ovverunes. May consume too much memery for your application though.
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            ch[index + 1] = '\0';
            PushWords(list, buff);
            index = 0;
        }
        else if (ch != ' ')
        {
            ch[index++] = ch;
        }
    }

    if (index > 0 && ch == ' ')
    {
        ch[index + 1] = '\0';
        PushWords(list, buff);
        index = 0;
    }
}

int main()
{
    LLNode* list = NULL;
    PushWords(&list, "Hello world this is a hello world test bla");
    GetCount(list);
    // release out memery here
}

I wrote that just now so it proboly wont work - but that is the general idea.

Another rough draft this time in C++ (note: std::map has pretty good search times):

#include <iostream>
#include <string>
#include <map>

using namespace std;

typedef map<string, int> CountMap;

void PushWords(CountMap& list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    string str;
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            list[str] = list[str] + 1;
            index = 0;
        }
        else if (ch != ' ')
        {
            str += ch;
            index++;
        }
    }

    if (index > 0 && ch == ' ')
    {
        list[str] = list[str] + 1;
    }
}

void PrintCount(CountMap& list)
{
    CountMap::iterator iter = list.begin(), end = list.end();
    for (; iter != end; ++iter)
    {
        cout << (*iter).first << " : " << (*iter).second;
    }
}


int main()
{
    CountMap map;
    PushWords(map, "Hello world this is a hello world test bla");
    PrintCount(map);
}
nlaq
I will need some time to study your code. By the way thanks for your help. You can code faster than I can type!
vinc456
it would be good to use tokenization ...
FL4SOF
+4  A: 

Just for the curious, here is a simple Ruby solution of the word count problem. It should be basically the same algorithm in C, just with a lot more code.

h = Hash.new(0)
File.read("filename.txt").split.each do |w|
  h[w] += 1
end
p h
martinus
+2  A: 

Does this count?

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
    char buffer[2048];
    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s file\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    snprintf(buffer, sizeof(buffer), "tr -cs '[a-z][A-Z]' '[\\n*]' < %s |"
                                     " sort | uniq -c | sort -n", argv[1]);
    return(system(buffer));
}

It basically encapsulates the canonical script illustrating how to count words on Unix as a shell script.

The 'tr' command translates anything that isn't an alphabetic character into a newline and squeezes out duplicates. The first 'sort' groups all occurrences of each word together. The 'uniq -c' counts the number of consecutive appearances of each word, printing the word and its count. The second 'sort' puts them in order of increasing repeats. You might need to dink with options to 'tr'; it isn't the stablest command from system to system, and it manages to routinely make me do manual bashing. On Solaris 10 using /usr/bin/tr, the code above produces (on its own source):

   1
   1 A
   1 EXIT
   1 FAILURE
   1 Usage
   1 Z
   1 a
   1 c
   1 cs
   1 exit
   1 file
   1 fprintf
   1 if
   1 main
   1 return
   1 sizeof
   1 snprintf
   1 stderr
   1 stdio
   1 stdlib
   1 system
   1 tr
   1 uniq
   1 z
   2 argc
   2 char
   2 h
   2 include
   2 int
   2 s
   2 sort
   3 argv
   3 n
   4 buffer
Jonathan Leffler
This works fine. I was not aware of the'tr' command. Thank you for sharing and your wonderful explaination :-)
vinc456
A: 

C implementation courtesy of here

vinc456
+1  A: 

For individual words, there's no need to write a program at all unless this is part of something larger:

sed -e 's/[[:space:]]/\n/g' < file.txt | grep -c WORD
A: 
#include <conio.h>
#include <iostream.h>
#include <fstream.h>
#include <cstdlib>

struct stdt
{
       char name[20] ;
       int id ;

}; //std

int main()
{
      stdt boy ;
      int a = 0 ;
      ofstream TextFile ;
      cout << "Begin File Creation \n" ;
      TextFile.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      if ( !TextFile)
      {
           cerr <<"Erro 100 Openoing File.DAT" ;
           exit(100);     
      }//end if
      while ( a < 3 )
      {
            TextFile.write( (char*) &boy , sizeof (boy) ) ;
            cout << "\nEnter Name : " ;
            cin  >> boy.name;
            cout << "\nEnter ID : " ;
            cin  >> boy.id ;
            a++;
      }//end while

      TextFile.close();
      cout << "\nEnd File Creation" ;

      ifstream TextFile1 ;
      TextFile1.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      while ( TextFile1.read( (char*) &boy , sizeof (boy) ) )
      {
            cout << "\nEnter Name : " << boy.name; 
            cout << "\nEnter ID : " << boy.id ;


      }// end While

      getch();
      return 0 ;
}//end main
A: 

in Perl:

my %wordcount = ();
while(<>){map {$wordcount{$_}++} (split /\s+/)}
print "$_ = $wordcount{$_}\n" foreach sort keys %wordcount;

and in Perl Golf (just for fun):

my%w;                       
map{$w{$_}++}split/\s+/while(<>); 
print"$_=$w{$_}\n"foreach keys%w;
dsm
A: 

vink456, were you ultimately able to write this program? If yes, could you post the solution for us to see how it works?

hangman1060
I think I tweaked Bo Majewski's code to suit my needs but I can no longer find my source and his site seems to have disappeared :( There are some other solutions on this page so maybe give those a try.
vinc456