tags:

views:

206

answers:

4

I've got to modify a C program and I need to include a set of unsigned integer sets. That is, I have millions of sets of integers (each of these integer sets contains between 3 and 100 integers), and I need to store these in some structure, lets call it the directory, that can in logarithmic time tell me whether a given integer set already exists in the directory. The only operations that need to be defined on the directory is lookup and insert.

This would be easy in languages with built-in support for useful data structures, but I'm a foreigner to C and looking around on Google did (surprisingly) not answer my question satisfactorily. This project looks about right:

http://uthash.sourceforge.net/

but I would need to come up with my own hash key generator.

This is a standard, simple problem, so I hope there is a standard and simple solution.

A: 

Implement a simple hash table yourself. It will make you a better programmer, when you know how to implement one on your own.

http://en.wikipedia.org/wiki/Hash_table

Tuomas Pelkonen
It may be true that it would make me a better programmer to implement this myself. However, it's not much of an answer. If I simply wanted to become a better programmer, there are probably better exercises I could spend my time on. Furthermore, it's unlikely that I will implement solution that performs optimally, and it's likely that a high-performing solution will take me a lot of time to implement. I find it strange that there is no library like C++'s STL that would give me a simple solution, and that instead I need to re-invent (or re-implement) the wheel.
conradlee
A: 

EDIT: sorry, I started answering as it's C++ and not C. Yes then you should find your hash function and code it by yourself.. since you already know the average dimension of a set it's not so difficult, just choose a good hash function! But you'll need to codify a whole set in a single number if you want to check if a directory is already there.

You can try by iteratively hashing the single numbers of the set:

int hashcode = initvalue
for (int i = 0; i < 0; ++i)
  hashcode = calc_code(hashcode, number_set[i], i);

in a way that the hashfunction depends on its previous value, the current number and the current index.

What about STL sets?

#include <set>

int nums[6] = {1,6,34,2,67,41};
set<int> numbers;

for( int i = 0; i < 6; ++i ) numbers.insert(nums[i]);

for( set<int>::const_iterator iter = numbers.begin(); iter != numbers.end(); ++iter )
  cout << *iter << ' ';

Using this data structure you can easily store all your sets, but you need also a way to check if a set is already included in the directory. It's not clear: do you want to know if a set that have all the SAME elements exists already in the directory?

You can do it manually by checking all the elements but since you have millions of them you should find a way to hash the elements of the set in an unique number and use a map of sets..

Jack
The OP asked about a C program, and the STL is purely C++.
David Thornley
STL is for C++, this is question is tagged as "C"
Remo.D
yes, sorry, I edited it :)just woke up.. still a little bit blurry
Jack
A: 

If I understand you correctly, you want to represent a set of sets of integer which I don't think is particularly trivial.

The first point is to represent a set of integers. The simplest way would be use a variable size array like this:

typedef struct { 
  int size;
  int elems[1];
} intset;

than you can create a new set (with a fixed number of elements) with

intset *newset(int size) 
{ 
  intset *set;
  set = malloc(sizeof(intset) + sizeof(int)*(size-1));
  if (set) set->size = size;
  return set;
}

and store the elements with set->elems[0]=i1; ....

Another option would be to use bit arrays but the implementation would depend of the nature of the integers to store (e.g. are they within a fixed range? Do they usually appear in groups in a set?).

Once you have your set of integers you will need a compare function (to determine whether two sets have the same elements). If you opted for an array to represent a set and you keep that array sorted, is quite simple to check if two sets are identical; if it's a bitmap, it will depend on how you implemented it.

Now, for the set of sets you can choose a (sorted) vector, that you might need to resize from time to time while inserting elements, or an hash table. In the latter case you'll need to write an hash function for your sets of integer (possibly using existing functions!).

As I said, it seems not trivial to me, I'm not surprised google didn't help.

It's not extremely complicated, though, you just have to take some decisions before proceeding.

Remo.D
I'm surprised to hear that it's not trivial, because in other languages (even the similar C++ with its STL), it would be trivial.The integer values are unsigned and in some fixed range (as in the range is known at runtime, not compile time), in most cases between 0 and 10 million, although in some cases between 0 and up to 100 million.If I use use a hash table, do any hash functions come to mind? Would zoborist hashing be appropriate here?
conradlee
+1  A: 

It depends on what you are going to do with the data. But maybe tsearch does already what you want. You could also build a sorted array for each set and look up the values with bsearch, although the performance may suffer during the insertion.

EDIT: If you are looking for an (external) library, you'll find a comparision of some C and C++ hash table implementation here. The author of the article has written a generic header implementation called khash. So you're compiled binary don't have any additional dependencies.

quinmars