views:

421

answers:

7

The domain of interest is string matching. Assume I have a structure like this.

typedef struct
{
    char *name,
    int (*function)();

} StringArray

StringArray s[] = 
{
    {"George", func1},
    {"Paul",   func2},
    {"Ringo",  func3},
    {"John",   func4}
}

There are a fixed number of strings in the array. They are hard coded as in the example. If the table changes, there would be a need to reevaluate the quality of the hash function.

I want to apply a hash function to a string, and if the string matches one in the array, then call the fuction. A perfect hash function is needed for this. No collisions are allowed. The purpose of requiring hashing is to get O(1) performance on the lookup.

What ideas do you have on designing a function to do this?

A: 

You can use map

std::string foo() { return "Foo"; }
std::string bar() { return "Bar"; }

int main()
{
   std::map<std::string, std::string (*)()> m;
   m["foo"] = &foo;
   m["bar"] = &bar; 
}
Vinay
std::map doesn't use a hash - it is tree based
anon
who voted this up?!?
Mitch Wheat
why to invent the wheel, you can use existing libraries like map.
Vinay
perhaps the questioner wanted the performance characteristics of hashing rather than those of tree searching?
anon
@Mitch: what's the problem? We should often ask ourselves not how to do this or that but why I want to use this or that.
Mykola Golubyev
Precisely. Why does EvilTeach want a hash function? There may be good reasons for exactly that, but I haven't seen them. For the general case of taking a string and executing a function, this works fine.
David Thornley
For the reason stated by Niel. Performance.
EvilTeach
+15  A: 

See the gperf home page.

anon
The gulling part is that there is a link to this at the bottom of the wikipedia page.
EvilTeach
A: 

If collisions are absolutely not allowed, your only option is to keep track of every string in the database, which is probably not a best way to go.

What I would do is apply one of existing common strong hashing algorithms, such as: MD5 or SHA. There's miriads of samples all around, here's one for example: http://www.codeproject.com/KB/security/cryptest.aspx

galets
A: 

Well, there is no perfect hash function.

You have several that minimize collisions, but none eliminates them.

Can't advise one though :P

EDIT: The solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.

Megacan
http://en.wikipedia.org/wiki/Perfect_hashing
Adam Rosenfield
@Adam: There's a pretty big caveat there given that it only applies when there is a distinct data set. Since the OP didn't make any mention of limiting the strings being used I'd agree with Megacan that there is no perfect hash in this case. +1.
sipwiz
The questioner does make that mention, at least implicitly - there were only four Beatles )or siz if you include the drummer they sacked and Stu whatsisname) - still, a fixed data set.
anon
I was only saying, the solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.
Megacan
Yes it will be a fixed data set, there are only so many strings he could type in a lifetime :). The point here though is that "there is no perfect hash function" is a truism unless there are explicit constraints specified. Megacan was logoically correct in my book.
sipwiz
@megacan - follow the link in the answer I posted - perfect hash functions are possible and quite widely used
anon
@spi You are correct. I should make it clear that that the table is static.
EvilTeach
@Neil - I've read the link. It says that a perfect hash function maps a key to a single hashcode, and no 2 keys have the same hash? am I right? I'm not arguing with that. That can only be possible if the number of combinations of keys are less or equal than the number of hashcodes.
Megacan
A: 

Use a balanced binary tree. Then you KNOW behaviour is ALWAYS O(logn).

I strongly dislike hashes. People don't realise how much risk they take with their algorithm. They run some test data and then deploy in the field. I have NEVER seen a deployed hash algorithm get checked for behaviour in the field.

O(log n) is almost always acceptable in lieu of O(1).

Blank Xavier
“O(log n) is almost always acceptable in lieu of O(1).” In many applications, this statement is flat out wrong. Just increase the number of data points to above a few millions to see this.
Konrad Rudolph
Once you've done that, test. Hashes do not give guaranteed results, unless you know in advance what all possible inputs can be. A hash function that tends to clump the input will likely not give you O(1).
David Thornley
In this case, all the inputs are known. They are sitting in the array.and input string is either an exact match, or a no-match.
EvilTeach
No, all the dispatchable inputs are known. Once you've hashed the input string, and gotten a hit, you do need to compare.
David Thornley
Konrad - yes. But the number of applications with so many datum are few and far between. They form a specific, rather rare class. This is why I said *almost always acceptable*.
Blank Xavier
@david, ya, absolutely correct.
EvilTeach
+1  A: 

The summary lists both C and C++. Which of them are you looking for? C and C++ are two distinct languages, and differ greatly in their string handling and data structures (and the fact that the C ones work in C++ doesn't change that).

Why, specifically, do you want a perfect hash function? Is it that you want to associate a string with a function, and thought that would be a good way to do it? Is this some sort of homework assignment? Do you have a reason not to use map<> in C++? (Or unordered_map<> if available?)

If you do need a perfect hash, what are the constraints on the strings? Will there be a certain fixed set you want to dispatch on? What about strings that don't match one of the set? Are you willing to accept hits from random strings, or is the number of incoming strings limited?

If you could edit your question to include information like that, we could be a lot more helpful.

EDIT (in response to the first two comments):

OK, we should look at C solutions, since you presumably want this for both your C and C++ work. You presumably want the performance, but have you tested? If we're dealing with strings coming in on the I/O system, the time there is likely to dwarf the dispatch time.

You are expecting arbitrary strings. It's a bit much to expect a perfect hash function that will avoid all collisions from random data, so you do need to consider that.

Have you considered a trie? It may be more efficient than a perfect hash function (or may not be), it should be fairly easy to implement in C, and it will avoid problems with redoing your list of dispatching strings or possible collisions.

David Thornley
I code in both c and c++, and god help me Pro*C.O(1) hashing for performance.Lol, no homework assignment. I am looking to build a tool to speed up some performance critical code. The example is made simple for discussion purposes. The real world use is not.
EvilTeach
The strings will very in length. None of them will be of length zero.As a practical limit, no string in the array will be longer than 32 characters. What the caller passes in can be of any length, but if it is longer than the strings in the table it is the case of a no-match
EvilTeach