ansaurus

Question

Counting unique words in a file? Good Linear Search alternative?

Answer 1

+5 A:

You can put all of the words into a trie and then count the number of words after you have processed the whole file.

James McNellis 2010-08-23 02:09:28

Justin Ardini 2010-08-23 02:14:06

DAWG and tries? Guys, are you serious? It's like comparing two numbers with quicksort :)

Nikita Rybak 2010-08-23 02:21:23

@Nikita: There was no mention of the data size. And though I agree these suggestions are probably overkill for the OP, they could be quite useful for someone else who runs into the same situation.

Justin Ardini 2010-08-23 02:22:33

@Nikita: In my opinion, this is a good, general-case solution. The time complexity is linear with respect to the number of words in the file. The implementation is straightforward. In addition, a trie data structure is easy to adapt to many other, related problems.

James McNellis 2010-08-23 02:27:39

@James Of course, quicksort is also reasonably simple to implement, provides same complexity and can be adapted to many different problems :)

Nikita Rybak 2010-08-23 02:49:53

@James: "The implementation is straightforward." only if waste of memory is acceptable or one works only with latin alphabet. Otherwise, one soon discovers all *pleasures* of working with raw bits. I think a link to some usable trie implementation in C would be definitely helpful. At least as to not to send people reinvent wheels right away.

Dummy00001 2010-08-23 02:54:08

Answer 2

+4 A:

Binary Search Trees work fine for strings.

If you don't care about having the words in sorted order, you can just use a hash table.

caf 2010-08-23 02:12:43

Answer 3

+1 A:

I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.

Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.

edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages

Nikita Rybak 2010-08-23 02:13:25

Actually I'm doing this couting to know a good size for my hash table.

Vandell 2010-08-23 02:28:04

@Vandell You can increase size of hashtable 'on the fly'. Start with, say, 32 and each time your hashtable becomes overpopulated take twice as big table instead. It's fairly common approach, among others official Java ht implementation (HashMap) takes it.

Nikita Rybak 2010-08-23 02:42:32

torak 2010-08-23 02:59:14

Answer 4

A:

If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.

Steve Emmerson 2010-08-23 02:31:01

My list is not sorted, so I can't use binary search.

Vandell 2010-08-23 02:48:51

Don't put the words into a linked-list -- put them into a structure that's then put into a hash table using hsearch(), instead. The structure should have a counter for the number of occurrences.

Steve Emmerson 2010-08-23 18:12:40

Answer 5

+1 A:

The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.

Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).

In "The C Programming Language" From K&R you have the exact example of what you are looking for. The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)

the structure is something like this:

struct tnode {
        char *word;
        struct tnode *left;
        struct tnode *right;
};

In the book you can see the whole example of what you want to do.

Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.

Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p

EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..

Marco 2010-08-23 02:43:13

@charco I'm not necessarily using a linked list, and he noticed it suggesting a binary search. Everybody here knows that you need random acess to have an efficient binary search.

Vandell 2010-08-23 15:03:41

sorry, I missunderstood. In that case, just take the first part of my answer and do a binary search.. But I think that the binary tree is a better solution (since you will do basically the same).

Marco 2010-08-23 17:33:56

And I'm using a binary tree, thank you very much!

Vandell 2010-08-23 23:12:12

Answer 6

A:

If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.

Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)

More advanced options are often platforms specific (e.g. glib on Linux).

P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.

Dummy00001 2010-08-23 02:43:43

Answer 7

+3 A:

You're counting the number of unique words in the file?

Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).

viksit 2010-08-23 03:05:38

The counter only needs 1 bit, too - two states indicating "appears once" or "appears more than once".

caf 2010-08-23 03:31:37

ansaurus

tags:

views:

answers:

Counting unique words in a file? Good Linear Search alternative?

related questions