views:

133

answers:

5

Hi,

A few days ago I posted this question and everyone suggested me to use void*, which I did. I think some of them also pointed a few things that I would need to take care of but I'm not sure what exactly were they. However, I'm having a few problems with this...

I'm not going to post all my code where cause it's quite large, instead, I'll post the things that I think are important and hopefully are enough for you to help me out.

My hash table structure is like this:

typedef void * HashKey;
typedef void * HashValue;

typedef struct sHashItem {
    HashKey key;
    HashValue value;

    char status;
} HashItem;

typedef struct sHashTable {
    HashItem *items;

    int count;
    float load;
    int size;

    Bool (*compare)(HashKey, HashKey);
    unsigned (*hash)(void *);
} HashTable;

The signature for my insert function goes like this:

Bool hashInsert(HashTable * const table, HashKey key, HashValue value);

And somewhere inside that function, when I find a free bucket in the hash table, I do this:

table->items[index].key = key;
table->items[index].value = value;
table->items[index].status = USED;
table->load = ++table->count / (float)table->size;

All this presents a few problems:

1) As you can see above I'm simply setting each key/value pair of the free bucket to the same pointer passed as the key/value hashInsert function arguments. This presents a problem as you may have already noticed... For instance, doing something like this:

char str[50];
scanf("%s%*c", str);
hashInsert(t1, (HashKey)str, (HashValue)5);
scanf("%s%*c", str);
hashInsert(t1, (HashKey)str, (HashValue)3);

And if the input is "KeyA" and then "KeyB", both will have "KeyB" as the buckets keys. The same thing applies to the value and not just the key since they are basically the same type because I want to have my code fully modular, for any data type.

How could I solve this?

My first though is to use strdup(str) and pass that to the hashInsert function. That would solve the problem. And since this was handled in the main code, I could easily use malloc() too for any other data type I need to pass as the value (the key will probably always be a string or an int).

But this solution presents another problem...

2) How should I free this allocated memory? Sure, it was allocated by the "main programmer" and not the "hash table module programmer" so, the "main programmer" should free it in the main code, right? However, that doesn't look much like modular code to me.

My code also has a hashDestroy function, to free all the allocated memory. But how can I use this function to free everything? I can't just iterate over every key/value and use free() on them cause maybe some of them weren't malloc'd by any programmer in the first place and I don't need to free them.

How can I find out which ones my hashDestroy must free and which ones it shouldn't?

3) To finish, I guess I can also throw this issue into the mix... In point one, my suggestion was to use strdup() or malloc to "fix" that specific problem (while introducing another) but that also don't look very modular to me. This memory allocation should be done in the hash table module code and not in the main code by the "main programmer".

How do you suggest I solve this? I mean, the data types can be anything and while the use of strdup() helps a lot, it only works for strings. What if I need to allocate memory for some specific structure or just an int?

Sorry for the big post but I think these questions are all related and I need some help figuring them out since my C knowledge is not that extreme. I only recently learned about void* so...

+1  A: 

There are two general solutions for dealing with collisions in a hashtable:

  1. Use the next free bucket instead.
  2. A bucket stores a linked list so multiple items can be stored in the same bucket.

With either of those, the issue of when to free what never arises, since all kinds of data are allocated either by the hash table or by the hash table's client. If you're still curious, the short answer to that dilemma is to use smart pointers.

outis
I'm not sure you understood the problem, this has nothing to do with collisions.
Nazgulled
If you're using pointers as keys, the problem is collisions (altering the pointed-to value doesn't give you a new key). If you wish to use the pointed-to value as your key, you must overcome the problem that pointers have no inherent way of recording the size of the pointed-to value.
outis
The collisions were a side effect that. If I fix my problems, I'll fix 'those' collisions.
Nazgulled
Picking referers (pointers) or referents (pointed-to values) as keys isn't a problem, it's a design choice. The problem is you're mixing the two. Once you decide between them, the only real problem in both cases is dealing with collisions. Memory management isn't an issue with pointer keys as it is with referents, but it be solved either with smart pointers or by always copying keys, so the hashtable owns the keys. For referents, you have the additional problem of figuring out the length of the referent, both for comparison and copy.
outis
A: 

To implement a hash table, we need a set of buckets. And since multiple elements can hash to the same bucket, each bucket needs a linked list.

Does

HashItem *items;

perform the second requirement above?

From your explanation, its not clear if it does.

For an excellent example, see K&R section 6.6. link where name = HashKey and defn = HashValue. alt text

ArunSaha
Once again, the problem has nothing to do with collisions.
Nazgulled
+1  A: 

Hmmm, from what I see in your example the problem isn't hashtable collisions (though you seem to have this problem as well), it is how to manage the memory of the items stored in the table. I think the standard way of doing this sort of thing IS to force the user of the data structure (the hashtable) to do the work of allocating the space for all the stuff that is going to be put in the table. The hashtable should only have to worry about the pointers. Suppose you do do an allocation then copy in the data structure: how would the user know how to deallocate the memory when the item is removed from the hastable?

zdav
agree. hashtable should only deal with pointers to objects, not objects itself
fazo
Actually not, collisions are not the issue at all, you people are misunderstanding. About the other thing you pointed out, could you please try to explain it again? I didn't quite get it what you meant...
Nazgulled
The point is: the hashtable code is completely separate from your application code. Because of this, the hashtable is not responsible for any memory except its own internal array (note: this is just an array of pointers, no actual data is stored, just pointers to data.)
zdav
+2  A: 

Wow: this is going to take some answering in full. However, one of the key things you're going to need is the size of whatever it is you are processing - it is fine to use a void pointer, but you need to know how big the object whose address you're receiving is.

[...] everyone suggested me to use void*, which I did. [...]

My hash table structure is like this:

typedef void * HashKey;
typedef void * HashValue;

typedef struct sHashItem {
    HashKey key;
    HashValue value;
    char status;
} HashItem;

typedef struct sHashTable {
    HashItem *items;
    int count;
    float load;
    int size;

    Bool (*compare)(HashKey, HashKey);
    unsigned (*hash)(void *);
} HashTable;

You are likely to need a size_t key_sz; and a size_t val_sz; member in HashItem. Your hash function pointer will need to know how big the key to be hashed is.

I'm in two minds about what the HashKey should be. It depends in part on how you are using this stuff. It looks like you want:

  • Given this key value of my choosing,
  • Store/return this data that is associated with it.

In which case, you probably also need to store the hash number somewhere in the HashItem; that is the value returned by your hashing function - apparently an unsigned integer. I'm not sure what the signature on the compare function (function pointer) should be; I'm suspicious that it should either take a pair of HashKey-and-size values, or possibly a pair of HashItem pointers.

The signature for my insert function goes like this:

Bool hashInsert(HashTable * const table, HashKey key, HashValue value);

And somewhere inside that function, when I find a free bucket in the hash table, I do this:

table->items[index].key = key;
table->items[index].value = value;
table->items[index].status = USED;
table->load = ++table->count / (float)table->size;

All this presents a few problems:

1) As you can see above I'm simply setting each key/value pair of the free bucket to the same pointer passed as the key/value hashInsert function arguments. This presents a problem as you may have already noticed... For instance, doing something like this:

char str[50];
scanf("%s%*c", str);
hashInsert(t1, (HashKey)str, (HashValue)5);
scanf("%s%*c", str);
hashInsert(t1, (HashKey)str, (HashValue)3);

The key to using void * is to pass addresses around. Casting should be unnecessary in C. You also need to pass the size of things. Hence:

Bool hashInsert(HashTable * const table, HashKey key, size_t key_sz,
                HashValue value, size_t val_sz);

char str[50];
scanf("%s%*c", str);
int value = 5;
hashInsert(t1, str, strlen(str)+1, &value, sizeof(value));

Internally, you will copy the data - not using 'strdup()' since you don't know that there aren't interior NUL '\0' bytes in it.

And if the input is "KeyA" and then "KeyB", both will have "KeyB" as the buckets keys. The same thing applies to the value and not just the key since they are basically the same type because I want to have my code fully modular, for any data type.

How could I solve this?

You have to define who owns what, and whether (and how) the container copies data. In C++, the containers make a copy of whatever they are storing.

My first thought is to use strdup(str) and pass that to the hashInsert function. That would solve the problem. And since this was handled in the main code, I could easily use malloc() too for any other data type I need to pass as the value (the key will probably always be a string or an int).

You can't use 'strdup()' because in general, neither the values nor the keys are strings. If they are always strings, why are you using 'void *' instead of 'char *'?

You can decide to copy the value and the key - as long as you know the sizes.

But this solution presents another problem...

2) How should I free this allocated memory? Sure, it was allocated by the "main programmer" and not the "hash table module programmer" so, the "main programmer" should free it in the main code, right? However, that doesn't look much like modular code to me.

My code also has a hashDestroy function, to free all the allocated memory. But how can I use this function to free everything? I can't just iterate over every key/value and use free() on them cause maybe some of them weren't malloc'd by any programmer in the first place and I don't need to free them.

How can I find out which ones my hashDestroy must free and which ones it shouldn't?

You can't. You have to define a policy and only if that policy allows you to do the destruction should you do it. If you copy everything, you have an easy time. If you copy nothing, you have a different easy time (arguably easier), but your consumers have a hell of a time because they need a structure to keep track of what they need to release - perhaps a hashed list...

3) To finish, I guess I can also throw this issue into the mix... In point one, my suggestion was to use strdup() or malloc to "fix" that specific problem (while introducing another) but that also don't look very modular to me. This memory allocation should be done in the hash table module code and not in the main code by the "main programmer".

Yes...that's basically my recommendation.

How do you suggest I solve this? I mean, the data types can be anything and while the use of strdup() helps a lot, it only works for strings. What if I need to allocate memory for some specific structure or just an int?

Note that copying only does shallow copies. If the structures you're copying contain pointers, then the duplication code will only copy the pointer and not the pointed at data.

So, a general solution will require some sort of copy function. You may have to demand that the user gives you a 'release' function that frees the memory in an item. You may need to have the user provide you with already fully allocated data. You need to think about who owns what the search function returns - is it still 'in' the hash table or has it been removed. Look hard at the C++ STL system - it generally speaking does an excellent job and modelling your requirements on what it requires may make sense. But remember, C++ has constructors and destructors to help it.

Jonathan Leffler
Damn, this sort of question/answering system does not work for this kind of question... I have so many things to discuss about your post that it makes it hard to do it in comments :S And I cannot do it right now either...
Nazgulled
@Nazgulled: I agree - it is an interesting question, but fiendishly hard to deal with in this format.
Jonathan Leffler
@Nazgulled: drop me an email if you want to join a chat about this - see my profile.
Jonathan Leffler
+2  A: 

I would malloc all of the data, and allow the client to the hash functions register a item_free() function at hash table init time. That way it's up to the "main programmer" how to handle it.

jdizzle
That maybe an easy way to handle it without worrying about it too much. I'll have to think about it...
Nazgulled