views:

71

answers:

2

Hi, i need to do a school project in C (I'm really don't know c++ as well).

I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?

Tkz Roberto

+1  A: 

If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.

Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?

arcwhite
Indeed, i need to reinvent the wheel :P Because its a school project, which objctive is to learn how data structs works in a real database for example...
Roberto
A: 

bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.

Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.

If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.

Peter Cordes