views:

1318

answers:

12

Hello,

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

For example:

set 1: {12,14,222,998,1,89,43,22,7654,23}
set 2: {44,23,64,76,987,3,2345,443,431,88}
set 3: {998,22,7654,345,112,32,89,9842,31,23}

Given the 3 sets of 10 numbers above sets 1 and sets 3 would be considered duplicates because they have 5 matching numbers. So, in this case I would remove set 3 (because it's considered similar to set 1).

I have over 10000 sets to compare and I want to do this very efficiently. I've been turning this over and I just can't think of an efficient way to perform this comparison (it would be great to do this in a single pass).

any ideas? Thanks!

Mike

+2  A: 

You don't say much about what the range of numbers that might appear are, but I have two ideas:

  • an inverted list that maps a number that appears in the lists to the lists that contain it, then intersect those lists to find those that have more than one number in common.

  • divide the numbers or group them into ranges of "close" numbers, then refine (narrow) the lists that have numbers appear in those ranges. You reduce the ranges for matching lists you have a manageable number of lists and you can compare the lists exactly . This would be a "proximity" approach I think.

A: 

Looks like you want to use the HashSet class. This should give you O(1) lookup time, which should give very efficient comparison if you get your loops right. (I'm not discussing the algorithm here, but rather simply suggesting a data structure in case it helps.)

Noldorin
Still comparing each value of each set with each other set is far from linear in time. This is a classical sample where you trade space vs. performance; with more computing space, you can achieve linear time.
Lucero
@Lucero: I didn't suggest that! I simply suggested a data structure that is appropiate. Was it really worth the down vote? :P
Noldorin
It doesn't help with the question asked, does it?
Lucero
I don't see how O(1) lookup for specific sets is going to help with the comparisons either.
Michael Borgwardt
And btw, I didn't downvote... someone else did.
Lucero
@Michael: Of course it will help. Just because I'm only partially answering the question, it doesn't mean it's wrong. I recommend you go read about data structures if you don't see why. :)
Noldorin
Even though I'm not Michael, the thing is that you first have to know how to solve the problem at hand (what kind of algorithm to use) before choosing specific data structures.
Lucero
I was responding to Michael's comment amount not being able to help. But anyway, I think you're taking an over-simplified view of things. Algorithms and data-structures are tightly intertwined, and you always need to consider one to consider the other - it is not a one way process!
Noldorin
+1  A: 

Since you need to compare all pair of sets, the algorithm is about O(N^2) where N is the size of the set.

For each comparison, you can do about O(X+Y), where X and Y are the size of two sets, in your case 10 each, so it is constant. But this requires you sort all the sets beforehand, so that adds to O(N*xlgx), again xlgx is constant in your case.

The linear comparison algorithm for two sets is fairly simple as the sets are sorted now, you can iterating both the sets at the same time. See c++ std::set_intersection for detail.

The entire algorithm is then O(N^2), which would be pretty slow for 10000 sets.

leiz
Certainly you can do better than O(N^2). See my answer.
ilya n.
+4  A: 

Another perfect job for a Signature Tree. Once again I'm aghast that there isn't a library out there that implements them. Let me know if you write one.

From the abstract of the first paper in the search results above:

We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.

Apocalisp
+17  A: 

You should rethink your requirements because as it is, the operation does not even have a well-defined result. For example, take these sets:

set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 
set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} 
set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

If you first consider 1 and 2 to be "duplicates" and eliminate set 1, then 2 and 3 are also "duplicates" and you are left with only one remaining set. But if you instead eliminate set 2 first, then 1 and 3 have no matches and you are left with two sets remaining.

You can easily expand this to your full 10,000 sets so that it would be possible that depending on which sets you compare and eliminate first, you could be left with only a single set, or with 5,000 sets. I don't think that is what you want.

Mathematically speaking, your problem is that you are trying to find equivalence classes, but the relation "similarity" you use to define them is not an equivalence relation. Specifically, it is not transitive. In layman's terms, if set A is "similar" to set B and set B is "similar" to set C, then your definition does not ensure that A is also "similar" to C, and therefore you cannot meaningfully eliminate similar sets.

You need to first clarify your requirements to deal with this problem before worrying about an efficient implementation. Either find a way to define a transitive similarity, or keep all sets and work only with comparisons (or with a list of similar sets for each single set).

Michael Borgwardt
Wow . Great Mathematical View ! :)
darkrain
It's the kind of thing you learn to do in your sleep when you study CS or math at university - and then hardly ever get to use at work.
Michael Borgwardt
Good explanation. Although I shortly mentionned exactly this issue in short two hours earlier and got downvoted.
Lucero
I think OP needs just to define which set will he remove in case of an overlapping pair. One possibility is to go on a first-come-first served basis. IMHO
kd304
As I described above, this can lead to radically different results depending on which sets are compared first and which are removed, with no useful way to decide on what would be a "correct" result. I don't think this would be useful.
Michael Borgwardt
@M.B.: Quote from my comment: "needs [...] to define". If you face an inconsistent case, define more constraints to resolve it - as it's been done in science several times.
kd304
@M.B.: I totally agree. Thats why my algorithm which I posted gives the number of identical numbers for any combination/pair of sets; this information should be a valuable information in order to find "duplicates" or overlapping sets. For instance, you could say that one is considered being a duplicate if (and only if) all involved sets share a certain overlap count. E.g. if A and B overlap 5, no other pair with either A or B may overlap 5 in oder to see A and B as duplicates. Or if you have A, B and C, all three pairs must be seend as duplicate in order to be counted as one.
Lucero
@Lucero: I didn't mean any offence, just wanted to point out a methodology to resolve the conflict.
kd304
Sure, you can define arbitraty constraints to get a repeatable result - the question is whether that result would be in any way useful to the OP, which I doubt. Lucero's suggestions moves the problem towards data mining and cluster analysis - which is a sensible approach, but also makes it more complex and slower - the opposite of what the OP actually asked.
Michael Borgwardt
I guess the problem here is simply that the author has posted his interpretation of the problem to be solved, which is not consistant...(and @kd304, no offense taken, why should I?)
Lucero
@Luceno: I'm sorry, didn't see a bigger picture at first. Thank you.
kd304
Qoute: "and then hardly ever get to use at work", oh, very very true. Fortunately.
kd304
A: 

Lets assume you have a class NumberSet which implements your unordered set (and can enumerate ints to get the numbers). You then need the following data structures and algorithm:

  • Map<int, Set<NumberSet>> numberSets
  • Map<Pair<NumberSet, NumberSet>, int> matchCount
  • Pair<X,Y> is a key object which returns the same hashcode and equality for each instance with the same X and Y (no matter if they are swapped)

Now for each set to be added/compared do the following (pseudocode!!!):

for (int number: setToAdd) {
   Set<NumberSet> numbers = numberSets.get(number);
   if (numbers == null) {
      numbers = new HashSet<NumberSet>();
      numberSets.put(number, numbers);
   } else {
      for (NumberSet numberSet: numbers) {
         Pair<NumberSet, NumberSet> pairKey = new Pair<NumberSet, NumberSet>(numberSet, setToAdd);
         matchCount.put(pairKey, matchCount.get(pairKey)+1); // make sure to handle null as 0 here in real code ;)
      }
   }
   numbers.add(number);
}

At any time you can go through the pairs and each which has a count of 5 or greater shows a duplicate.

Note: removing the sets may be a bad idea, because if A is considered a duplicate of B, and B a duplicate of C, so C doesn't have to be a duplicate of A. So if you remove B, you'd not remove C, and the order in which you add your sets would become important.

Lucero
Now that's constructive, downvoting without any comment why.
Lucero
+1  A: 

You should find the Pearson Coefficient between two sets of data. This method will make your program easily scalable to huge data sets.

kunjaan
+2  A: 

I don't think there's a nice and beautiful way to do it. Most other answers will have you make a comparison between most pairs x,y which would be O(N^2). You can do it faster.

Algorithm: keep an array of all 5-tuples. For each new split it into all possible 5-tuples, add to that array. At the end, sort and check for duplicates.

There are C(10, 5) = 10*9*8*7*6/120 = 9*4*7 , roughly 250 subsets of length 5 of set of length 10. So you're keeping a table which is 10^3 times larger than your data but perform just O(250*N) operations. That should work practically and I suspect that;s the best theoretically as well.

ilya n.
+1  A: 

There is a way to do this with high time efficiency but extremely low space efficiency.

If my maths is correct, every combination of 5 numbers from a set of 10 results in 10!(10-5)!5! = 252 combinations multiplied by 10000 sets = 2.52 million combinations. A set of 5 integers will consume 20 bytes so you could put every combination for every set into a HashSet. and only use 5 megabytes (plus overhead, which will blow it out by 2-3 times at least).

Now that might seem expensive but if the alternative, when you check a new set of 10 against the existing 10000 indidvidually, is that you calculate 252 sets of 5 and see if any of them are in the set then it has to be better.

Basically:

public class SetOf5 {
  private final static HashSet<Integer> numbers;
  private final int hashCode;

  public SetOf5(int... numbers) {
    if (numbers.length != 5) {
      throw new IllegalArgumentException();
    }
    Set<Integer> set = new HashSet<Integer>();
    hashCode = 19;
    for (int i : numbers) {
      set.add(i);
      hashCode = 31 * i + hashCode;
    }
    this.numbers = Collections.unmodifiableSet(set);
  }

  // other constructors for passing in, say, an array of 5 or a Collectio of 5

  // this is precalculated because it will be called a lot
  public int hashCode() {
    return numbers.hashCode();
  }

  public boolean equals(Object ob) {
    if (!(ob instanceof SetOf5)) return false;
    SetOf5 setOf5 = (SetOf5)ob;
    return numbers.containsAll(setOf5.numbers);
  }
}

You then just have to do two things:

  1. Create a HashSet<SetOf5> for all your existing tuples of 5; and
  2. Create an algorithm to create all the possible sets of 5.

Your algorithm then becomes: for each set of 10 numbers, create all possible sets of 5, check each one to see if it's in the set. If it is, reject the set of 10. If it's not, add the set of 5 to the "set of sets". Otherwise continue.

I think you'll find that'll be an awful lot cheaper--at least in the case of 5 numbers from 10--than any brute force comparison of 10000 sets with one another.

cletus
A: 

Maybe you need an algorithm such like this (as I understand your problem)?

import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;

/**
 * @author karnokd, 2009.06.28.
 * @version $Revision 1.0$
 */
public class NoOverlappingSets {
    // because of the shortcomings of java type inference, O(N)
    public static Set<Integer> setOf(Integer... values) {
        return new HashSet<Integer>(Arrays.asList(values));
    }
    // the test function, O(N)
    public static boolean isNumberOfDuplicatesAboveLimit(
            Set<Integer> first, Set<Integer> second, int limit) {
        int result = 0;
        for (Integer i : first) {
            if (second.contains(i)) {
                result++;
                if (result >= limit) {
                    return true;
                }
            }
        }
        return false;
    }
    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        List<Set<Integer>> sets = new LinkedList<Set<Integer>>() {{
            add(setOf(12,14,222,998,1,89,43,22,7654,23));
            add(setOf(44,23,64,76,987,3,2345,443,431,88));
            add(setOf(998,22,7654,345,112,32,89,9842,31,23));
        }};
        List<Set<Integer>> resultset = new LinkedList<Set<Integer>>();
        loop:
        for (Set<Integer> curr : sets) {
            for (Set<Integer> existing : resultset) {
                if (isNumberOfDuplicatesAboveLimit(curr, existing, 5)) {
                    continue loop;
                }
            }
            // no overlapping with the previous instances
            resultset.add(curr);
        }
        System.out.println(resultset);
    }

}

I'm not an expert in Big O notation but I think this algorithm is O(N*M^2) where N is the number of elements in the set and M is the total number of sets (based on the number of loops I used in the algorithm). I took the liberty of defining what I consider overlapping sets.

I think your problem is Polinomial. As I remember my lectures, the decision based version would be NP-hard - but correct me if I'm wrong.

kd304
And in terms of memory efficiency, I think the algorithm consumes about M * k bytes of RAM, where k is somewhere between 8 and 20. [edit]
kd304
A remark. I you want to retain the original order of number in the sets, replace HashSet with LinkedHashSet.
kd304
A: 

Hi,

it's an easy problem because your sets are limited to size of ten. For every set of ten numbers you have less than 1,000 subsets of the set which contain at least five numbers. Select a hash function that hashes integer sequences into, say, 32-bit numbers. For every set of ten integers, calculate the value of this hash function for every subset of integers with five or more elements. This gives less than 1,000 hash values per one set of ten numbers. Add a pointer to the set of ten integers to a hash table under all these 1,000 keys. Once you have done this, your hash table has 1,000 * 10,000 = 10 million entries, which is completely doable; and this first pass is linear (O(n)) because the individual set size is bounded by 10.

In the next pass, iterate through all the hash values in whatever order. Whenever there are more than one set associated with the same hash value, this means that most likely they contain a common subset of at least five integers. Verify this, and then erase one of the sets and the corresponding hash table entries. Continue through the hash table. This is also an O(n) step.

Finally, suppose that you are doing this in C. Here is a routine that would calculate the hash values for a single set of ten integers. It is assumed that the integers are in ascending order:

static int hash_index;

void calculate_hash(int *myset, unsigned int *hash_values)
{
  hash_index = 0;
  hrec(myset, hash_values, 0, 0, 0);
}

void hrec(int *myset, unsigned int *hash_values,
          unsigned int h, int idx, int card)
{
  if (idx == 10) {
    if (card >= 5) {
      hash_values[hash_index++] = h;
    }
    return;
  }
  unsigned int hp = h;
  hp += (myset[idx]) + 0xf0f0f0f0;
  hp += (hp << 13) | (hp >> 19);
  hp *= 0x7777;
  hp += (hp << 13) | (hp >> 19);
  hrec(myset, hash_values, hp, idx + 1, card + 1);
  hrec(myset, hash_values, h,  idx + 1, card);
}

This recurses through all the 1024 subsets and stores the hash values for subsets with cardinality 5 or more in the hash_values array. At the end, hash_index counts the number of valid entries. It is of course constant but I didn't calculate it numerically here.

antti.huima
A: 

We will take the data set, adorn each element with a signature, and sort it. The signature has the property that sorting will group those elements together which could have duplicates. When comparing data_set[j] to items in data_set[j+1 ...], when the first signature in [j+1 ...] duplicate check fails we, we advance i. This "adjacency criterion" assures we don't have to look further; no element beyond this can be a duplicate.

This reduces the O(N^2) comparison a great deal. How much I'll let an algorithm analyst decide, but the code below does ~400k comparisons instead of the 100m of a naive O(N^2).

The signature starts by bucketing the elements. We divide the range of the numbers into N equal sized buckets: 1..k, k+1..2k, 2k+1..3k, ... When iterating over the elements, we increment the count if the number falls into a particuar bucket. This yields an initial signature of the form (0,0,0,1,3,0,0,...4,2).

The signature has the property that if

sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5

then it is possible the elements associated with the signatures have at least 5duplicates. But more, if the above does not hold, then the elements cannot have 5 duplicates. Lets call this the "signature match criterion".

But, sorting by the above signature does not have the adjacency property mentioned above. However, if we modify the signature to be of the two element form:

(sum(sig[:-1]), sig[-1])

then the "signature match criterion" holds. But does the adjacency criterion hold? Yes. The sum of that signature is 10. If we enumerate, we have the following possible signatures:

(0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0)

If we compare (0,10) against (1,9) .. (10,0), we note that the once the signature test fails it never again becomes true. The adjacency criterion holds. Furthermore, that adjacency criterion holds for all positive values, not just "5".

Ok, that's nice, but splitting the signature into two large buckets won't necessarily reduce the O(N^2) search; the signature is overly general. We solve that problem by creating a signature for sig[:-1], producing

(sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ...

and so on. I believe this signature still satisfies adjacency, but I could be wrong.

There are some optimizations I didn't do: the signature only needs the last value of each tuple, not the first, but the sorting step would have to be revised. Also, the signature comparison could be optimized with an early fail when it becomes clear that further scanning is cannot succeed.

# python 3.0
import random

# M number of elements, N size of each element
M = 10000
N = 10

# Bounds on the size of an element of each set
Vmin,Vmax = 0, (1 << 12)

# DupCount is number of identical numbers required for a duplicate
DupCount = 5

# R random number generator, same sequence each time through
R = random.Random()
R.seed(42)

# Create a data set of roughly the correct size
data_set = [list(s) for s in (set(R.randint(Vmin, Vmax) for n in range(N)) for m in range(M)) if len(s) == N]

# Adorn the data_set signatures and sort
def signature(element, width, n):
"Return a signature for the element"
    def pearl(l, s):
        def accrete(l, s, last, out):
            if last == 0:
                return out
            r = l[last]
            return accrete(l, s-r, last-1, out+[(s-r,r)])
        return accrete(l, s, len(l)-1, [])
    l = (n+1) * [0]
    for i in element:
        l[i // width] += 1
    return pearl(l, len(element))

# O(n lg(n)) - with only 10k elements, lg(n) is a little over 13
adorned_data_set = sorted([signature(element, (Vmax-Vmin+1)//12, 12), element] for element in data_set)

# Count the number of possible intersections
def compare_signatures(sig_a, sig_b, n=DupCount):
    "Return true if the signatures are compatible"
    for ((head_a, tail_a), (head_b, tail_b)) in zip(sig_a, sig_b):
        n -= min(tail_a, tail_b)
        if n <= 0:
            return True
    return False

k = n = 0
for i, (sig_a, element_a) in enumerate(adorned_data_set):
    if not element_a:
        continue
    for j in range(i+1, len(adorned_data_set)):
        sig_b, element_b = adorned_data_set[j]
        if not element_b:
            continue
        k += 1
        if compare_signatures(sig_a, sig_b):
            # here element_a and element_b would be compared for equality
            # and the duplicate removed by  adorned_data_set[j][1] = []
            n += 1
        else:
            break

print("maximum of %d out of %d comparisons required" % (n,k))
themis
Wrong, the adjacency criterion doesn't hold. If you drop the "break" after the compare_signatures, you'll see more comparisons take place. However, the partitions created for the signatures could be treated as points in a k-dimensional space. Nearest-neighbor queries _might_ work.
themis