views:

102

answers:

3

I have a structure which has 3 identifier fields and one value field. I have a list of these objects. To give an analogy, the identifier fields are like the primary keys to the object. These 3 fields uniquely identify an object.

Class
{
   int a1;
   int a2;
   int a3;
   int value;
};

I would be having a list of say 1000 object of this datatype. I need to check for specific values of these identity key values by passing values of a1, a2 and a3 to a lookup function which would check if any object with those specific values of a1, a2 and a3 is present and returns that value. What is the most effective way to implement this to achieve a best lookup time?

One solution I could think of is to have a 3 dimensional matrix of length say 1000 and populate the value in it. This has a lookup time of O(1). But the disadvantages are. 1. I need to know the length of array. 2. For higher identity fields (say 20), then I will need a 20 dimension matrix which would be an overkill on the memory. For my actual implementation, I have 23 identity fields.

Can you suggest a good way to store this data which would give me the best look up time?

+4  A: 

Create a key class that contains all the identity fields, and define an appropriate equals function and hash method, and then use a hash map to map from the key class to its associated value. This will give you a time complexity of O(1) per lookup in the expected case, and it only requires space proportional to the number of actual key combinations observed (typically twice the number, although you can adjust the constant for the time/space tradeoff that you desire), rather than space proportional to all possible key combinations.

Michael Aaron Safyan
Keep in mind that it's only O(1) if you have a perfect hash (one item per bucket). This usually requires foreknowledge of the key properties.
paxdiablo
@paxdiablo, it's still O(1), even without a perfect hash... collisions do occur, believe it or not, even with ordinary hash table implementations... and if you use something like boost::hash_combine to construct the hash, then it shouldn't be a problem coming up with a fairly spread out hash function.
Michael Aaron Safyan
It is _not_ O(1). Perfect hashes, by definition, map each value to a unique key. That would be O(1) since the number of operations is identical regardless of the data set size. In *any* implementation where the hash isn't perfect, it's O(n) for simple lists in each bucket although it can be optimised to less if each bucket is yet another hash. But it will _never_ reach the mythical O(1).
paxdiablo
We're going to have to agree to disagree there, I'm afraid. That's no different to saying you've stored everything in an unsorted list and saying you're bounded by 1000 array-element comparisons, therefore constant-time. The fact remains that the complexity depends, not on 1 (constant time), but on the number of elements. As n approaches infinity (or some arbitrary large number), the divisor is irrelevant for complexity analysis, it's still dependent on n.
paxdiablo
@paxdiablo, I believe you are confusing the worst-case performance and the average-case performance. While it is true that the worst-case performance is O(n) -- assuming all hashes collide to the same index -- it is O(1) in the expected case.
Michael Aaron Safyan
For this algorithm, I have there are 23 keys (a1, a2... a23) and the length of the list would be in the upward of 50K. It is an optimization problem using dynamic programming and I am implementing a depth first search. Hash table seems to be a popular solution to this, and assuming that I come up with a good hash function, there will be collisions leading to more than one item per bucket and in this case should I sort the contents in each bucket? Can I use some simple hash functions like sumOf(keys)? Any suggestions about the hash function for this scenario?
Amrish
@Amrish, what programming language are you using? Most programming languages already provide a hash map / hashtable implementation, and most also provide some mechanism of creating good hashes.
Michael Aaron Safyan
I am using C++ for my implementation
Amrish
@Amrish, in that case, you might want to consider using boost::unordered_map as the hash map implementation and use boost::hash_combine to construct your hashes.
Michael Aaron Safyan
A: 

Use hash table (map). Construct the key to be "a1-a2-a3", and store data to H(key)=data.

sza
A: 

I would simply sort the array by key, and use a binary search.

(untested)

int compare_entry(ENTRY *k1, ENTRY *k2) {    
    int d = k1->a1 - k2->a1;
    if (d == 0) {
        d = k1->a2 - k2->a2;
        if (d == 0) {
            d = k1->a3 - k2->a3;
        }
    }
    return d; // >0 is k1 > k2, 0 if k1 == k2, <0 if k1 < k2
}

// Derived from Wikipedia
int find(ENTRY *list, int size, ENTRY *value) {
   int low = 0;
   int n = size - 1;
   int high = n;
   while (low < high) {
       int mid = low + (high - low) / 2
       int cmp = compare_entry(&list[mid], value);
       if (cmp < 0) {
           low = mid + 1;
       } else {
            high = mid; 
       }
   }
   if (low < n) {
       int cmp = compare_entry(&list[low], value);
       if (cmp == 0) {
           return low; // found item at 'low' index
       }
   } else {
        return -1;  // not found
   } 
}

Absolutely worst case, you run through this thing, what, 10 times, and end up actually doing all of the comparisons in the key comparison. So that's, what, 85 integer math operations (additions, subtraction, and 1 shift)?

if your a1-a3 are ranging 0-100, then you can make your key a1 * 10000 + a2 * 100 + a3, and do a single compare, and worst case is 63 integer math operations. And your entire array fits within cache on most any modern processor. And it's memory efficient.

You can burn memory with a perfect hash or some other sparse matrix. Even with a perfect hash, I bet the hash calculation itself is competitive with this time, considering multiplication is expensive. This hits the memory bus harder, obviously.

Will Hartung