views:

142

answers:

6

I am looking for an efficient way to solve the following problem.

List 1 is a list of records that are identified by a primitive triplet:

X | Y | Z

List 2 is a list of records that are identified by three sets. One Xs, one Ys, one Zs. The X, Y, Zs are of the same 'type' as those in list one so are directly comparable with one another.

Set(X) | Set(Y) | Set(Z)

For an item in list 1 I need to find all the items in list 2 where the X, Y, Z from list 1 all occur in their corresponding sets in list 2. This is best demonstrated by an example:

List 1:

X1, Y1, Z1

List 2:

(X1, X2) | (Y1) | (Z1, Z3)

(X1) | (Y1, Y2) | (Z1, Z2, Z3)

(X3) | (Y1, Y3) | (Z2, Z3)

In the above, the item in list 1 would match the first two items in list 2. The third item would not be matched as X1 does not occur in the X set, and Z1 does not occur in the Z set.

I have written a functionally correct version of the algorithm but am concerned about performance on larger data sets. Both lists are very large so iterating over list 1 and then performing an iteration over list 2 per item is going to be very inefficient.

I tried to build an index by de-normalizing each item in list 2 into a map, but the number of index entries in the index per item is proportional to the size of the item's subsets. As such this uses a very high level of memory and also requires some significant resource to build.

Can anyone suggest to me an optimal way of solving this. I'm happy to consider both memory and CPU optimal solutions but striking a balance would be nice!

+3  A: 

There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.

One simple technique is to build a table from list2, to accelerate the queries coming from list1.

from collections import defaultdict

# Build "hits".  hits[0] is a table of, for each x,
# which items in list2 contain it. Likewise hits[1]
# is for y and hits[2] is for z.
hits = [defaultdict(set) for i in range(3)]
for rowid, row in enumerate(list2):
    for i in range(3):
        for v in row[i]:
            hits[i][v].add(rowid)

# For each row, query the database to find which
# items in list2 contain all three values.
for x, y, z in list1:
    print hits[0][x].intersection(hits[1][y], hits[2][z])
Jason Orendorff
Thank you Jason, this has made me think about the problem entirely differently. I was stuck in the 'one index' line of thought, but creating three is much more flexible.
Scruffers
A: 

How about using HashSet (or HashSets) for List 2 ? This way you will only need to iterate over List 1

David Soroko
That would work if the items in list 1 and list 2 were directly comparable. However we need to test each item in list 2 for every occurrence in list 1 as the items are not directly comparable and list 2 is the driver.
Scruffers
Not comparable in the sense that List2 contains sets of primitives and List1 contains primitives? In that case convert List2 to a single HashSet (e.g. {X1, X2, Y1, Z1, Z3} ) and proceed by iterating over List1 and for each member do a quick contains() on the hashset. Creating the hashset is not free, but you can reason about the tradeoffs.
David Soroko
David, I see your point and this could work. I should have clarified that X, Y, Zs are actually just sets of integers so it would be possible to have same value in X and Y set but this distinction would be lost in a single hash set. My apologies.
Scruffers
+1  A: 

If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.

yawn
+1  A: 

You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.

Root --+--X1--+--EOF--> List of pointers to list2 lines containing only "X1"
       |      |
       |      +--X2---+--EOF--> List of pointers to list2 lines containing only "X1,X2"
       |      |       |
       |      |       +--X3--+--etc--
       |      |       
       |      +--X3---+--EOF--> "X1,X3"
       |             
       +--X2--+--EOF--> "X2"
       |      |
       |      +--X3---+--EOF--> "X2,X3"
       |      |       |
       ...

This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...

Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.

Alex Feinman
A: 

If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:

List<SomeType> list1 = ...;
List<Set<SomeType>> candidateFromList2 = ...;
if (Sets.cartesianProduct(candidateFromList2).contains(list1)) { ... }

But it's not that hard to check this "longhand" either.

Kevin Bourrillion
+1  A: 

There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.

from collections import defaultdict

# index is HashMap<X, HashMap<Y, HashMap<Z, Integer>>>
index = defaultdict(lambda: defaultdict(dict))
for rowid, (x, y, z) in enumerate(list1):
    index[x][y][z] = rowid

for rowid2, (xs, ys, zs) in enumerate(list2):
    xhits = defaultdict(list)
    for x in xs:
        if x in index:
            for y, zmap in index[x].iteritems():
                xhits[y].append(zmap)

    yhits = defaultdict(list)
    for y in ys:
        if y in xhits:
            for z, rowid1 in xhits[y].iteritems():
                yhits[z].append(rowid1)

    for z in zs:
        if z in yhits:
            for rowid1 in yhits[z]:
                print "list1[%d] matches list2[%d]" % (hit[z], rowid2)

The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.

Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).

Jason Orendorff