views:

516

answers:

6

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:

DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers

This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.

The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.

Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.

We are using the following algorithm. Convert the list values to a string via

string signature = String.Join("|", values.OrderBy(n => n).ToArray());

then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)

Is there a better method?

+5  A: 

You could probably just hash the list itself, instead of going through String.

Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)

You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).

It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.

You cannot do better than O(n + m), because every entry needs to be visited at least once.

Edit: misread, fixed.

Thomas
I believe that the hash algorithm for list does not order the elements before computing that hash so that it can distinguish {1, 2} from {2, 1}. Thus, at a minimum, ordering is necessary. But then you're right, we can hash the ordered list instead of first going through String.
Jason
Ah, good point. I guess, if order doesn't matter there, that you might use a HashSet<int> instead of a List<int> there as well. The HashSet will probably hash to a decent hash value irrespective of ordering :)
Thomas
HashSet doesn't implement GetHashCode, so the hash code isn't based on the data in the HashSet.
Guffa
A: 

Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal? If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.

Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.

Calyth
As per the comment on my post, I deduce that ordering does not matter, so [1,2] is the same as [2,1].
Thomas
+3  A: 

On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List. You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.

Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it. If there is already an existing key, only then do you sort the corresponding Lists and compare them.

You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.

If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.

Code

Below is a first-attempt at implementing this.

Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();

public bool CheckDuplicate(List<int> theList) {
    bool isIdentical = false;
    int xorkey = 0;
    foreach (int v in theList) xorkey ^= v;

    List<List<int>> existingLists;
    checkHash.TryGetValue(xorkey, out existingLists);
    if (existingLists != null) {
        // Already in the dictionary. Check each stored list
        foreach (List<int> li in existingLists) {
            isIdentical = (theList.Count == li.Count);
            if (isIdentical) {
                // Check all elements
                foreach (int v in theList) {
                    if (!li.Contains(v)) {
                        isIdentical = false;
                        break;
                    }
                }
            }
            if (isIdentical) break;
        }
    }
    if (existingLists == null || !isIdentical) {
        // never seen this before, add it
        List<List<int>> newList = new List<List<int>>();
        newList.Add(theList);
        checkHash.Add(xorkey, newList);
    }
    return isIdentical;
}

Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.

If a duplicate key is found, we loop through each previously stored List until we found a mismatch.

The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

Renaud Bompuis
Elegant. I like.
Jason
As long as nothing is known about the distribution of the incoming numbers, the XOR method might not be a very good hash function...
Thomas
not it's not, that's why you still need to check if you have a collision. The main advantage is that it's fast and you only need to do the expensive test when there is a collision.
Renaud Bompuis
What you have done is to duplicate what the dictionary already does... Only slightly less efficiently... ;)
Guffa
Agreed, yours is a better use of the framework and more elegant than my brute-force approach.
Renaud Bompuis
+2  A: 

Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.

If the lists are sorted, it could be as simple as this:

IntListEqualityComparer : IEqualityComparer<List<int>> {

   public int GetHashCode(List<int> list) {
      int code = 0;
      foreach (int value in list) code ^=value;
      return code;
   }

   public bool Equals(List<int> list1, List<int> list2) {
      if (list1.Count != list2.Coount) return false;
      for (int i = 0; i < list1.Count; i++) {
        if (list1[i] != list2[i]) return false;
      }
      return true;
   }

}

Now you can create a dictionary that uses the IEqualityComparer:

Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());

Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.

You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).

Guffa
A: 

It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.

This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.

Ben S
A: 

Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?

Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.

Conrad
XOR and other bit-wise operations are usually better choices because you can't get an overflow whereas an addition might result in one.
Renaud Bompuis
I originally think that it may not be a big deal whether there is an overflow or not as the actual value of the sum is not meaningful. Sorry if this approach is not a sound one.
Conrad