We will take the data set, adorn each element with a signature, and
sort it. The signature has the property that sorting will group
those elements together which could have duplicates. When
comparing data_set[j] to items in data_set[j+1 ...], when the
first signature in [j+1 ...] duplicate check fails we, we advance
i. This "adjacency criterion" assures we don't have to look further;
no element beyond this can be a duplicate.
This reduces the O(N^2) comparison a great deal. How much I'll let
an algorithm analyst decide, but the code below does ~400k comparisons
instead of the 100m of a naive O(N^2).
The signature starts by bucketing the elements. We divide the range
of the numbers into N equal sized buckets: 1..k, k+1..2k, 2k+1..3k, ...
When iterating over the elements, we increment the count if the number
falls into a particuar bucket. This yields an initial signature of
the form (0,0,0,1,3,0,0,...4,2).
The signature has the property that if
sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5
then it is possible the elements associated with the signatures
have at least 5duplicates. But more, if the above does not hold,
then the elements cannot have 5 duplicates. Lets call this the
"signature match criterion".
But, sorting by the above signature does not have the adjacency property
mentioned above. However, if we modify the signature to be of
the two element form:
(sum(sig[:-1]), sig[-1])
then the "signature match criterion" holds. But does the adjacency
criterion hold? Yes. The sum of that signature is 10. If we
enumerate, we have the following possible signatures:
(0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0)
If we compare (0,10) against (1,9) .. (10,0), we note that the
once the signature test fails it never again becomes true. The
adjacency criterion holds. Furthermore, that adjacency criterion
holds for all positive values, not just "5".
Ok, that's nice, but splitting the signature into two large buckets
won't necessarily reduce the O(N^2) search; the signature is overly
general. We solve that problem by creating a signature for
sig[:-1], producing
(sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ...
and so on. I believe this signature still satisfies adjacency, but
I could be wrong.
There are some optimizations I didn't do: the signature
only needs the last value of each tuple, not the first, but
the sorting step would have to be revised. Also, the signature
comparison could be optimized with an early fail when it becomes
clear that further scanning is cannot succeed.
# python 3.0
import random
# M number of elements, N size of each element
M = 10000
N = 10
# Bounds on the size of an element of each set
Vmin,Vmax = 0, (1 << 12)
# DupCount is number of identical numbers required for a duplicate
DupCount = 5
# R random number generator, same sequence each time through
R = random.Random()
R.seed(42)
# Create a data set of roughly the correct size
data_set = [list(s) for s in (set(R.randint(Vmin, Vmax) for n in range(N)) for m in range(M)) if len(s) == N]
# Adorn the data_set signatures and sort
def signature(element, width, n):
"Return a signature for the element"
def pearl(l, s):
def accrete(l, s, last, out):
if last == 0:
return out
r = l[last]
return accrete(l, s-r, last-1, out+[(s-r,r)])
return accrete(l, s, len(l)-1, [])
l = (n+1) * [0]
for i in element:
l[i // width] += 1
return pearl(l, len(element))
# O(n lg(n)) - with only 10k elements, lg(n) is a little over 13
adorned_data_set = sorted([signature(element, (Vmax-Vmin+1)//12, 12), element] for element in data_set)
# Count the number of possible intersections
def compare_signatures(sig_a, sig_b, n=DupCount):
"Return true if the signatures are compatible"
for ((head_a, tail_a), (head_b, tail_b)) in zip(sig_a, sig_b):
n -= min(tail_a, tail_b)
if n <= 0:
return True
return False
k = n = 0
for i, (sig_a, element_a) in enumerate(adorned_data_set):
if not element_a:
continue
for j in range(i+1, len(adorned_data_set)):
sig_b, element_b = adorned_data_set[j]
if not element_b:
continue
k += 1
if compare_signatures(sig_a, sig_b):
# here element_a and element_b would be compared for equality
# and the duplicate removed by adorned_data_set[j][1] = []
n += 1
else:
break
print("maximum of %d out of %d comparisons required" % (n,k))