ansaurus

Question

Efficient Matching Algorithm for Set Based Triplets

Answer 1

+3 A:

There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.

One simple technique is to build a table from list2, to accelerate the queries coming from list1.

from collections import defaultdict

# Build "hits".  hits[0] is a table of, for each x,
# which items in list2 contain it. Likewise hits[1]
# is for y and hits[2] is for z.
hits = [defaultdict(set) for i in range(3)]
for rowid, row in enumerate(list2):
    for i in range(3):
        for v in row[i]:
            hits[i][v].add(rowid)

# For each row, query the database to find which
# items in list2 contain all three values.
for x, y, z in list1:
    print hits[0][x].intersection(hits[1][y], hits[2][z])

Jason Orendorff 2010-01-04 21:08:53

Thank you Jason, this has made me think about the problem entirely differently. I was stuck in the 'one index' line of thought, but creating three is much more flexible.

Scruffers 2010-01-04 21:34:06

Answer 2

A:

How about using HashSet (or HashSets) for List 2 ? This way you will only need to iterate over List 1

David Soroko 2010-01-04 21:16:08

That would work if the items in list 1 and list 2 were directly comparable. However we need to test each item in list 2 for every occurrence in list 1 as the items are not directly comparable and list 2 is the driver.

Scruffers 2010-01-04 21:30:54

Not comparable in the sense that List2 contains sets of primitives and List1 contains primitives? In that case convert List2 to a single HashSet (e.g. {X1, X2, Y1, Z1, Z3} ) and proceed by iterating over List1 and for each member do a quick contains() on the hashset. Creating the hashset is not free, but you can reason about the tradeoffs.

David Soroko 2010-01-05 06:47:44

David, I see your point and this could work. I should have clarified that X, Y, Zs are actually just sets of integers so it would be possible to have same value in X and Y set but this distinction would be lost in a single hash set. My apologies.

Scruffers 2010-01-05 10:43:39

Answer 3

+1 A:

If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.

yawn 2010-01-04 21:29:38

Answer 4

+1 A:

You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.

Root --+--X1--+--EOF--> List of pointers to list2 lines containing only "X1"
       |      |
       |      +--X2---+--EOF--> List of pointers to list2 lines containing only "X1,X2"
       |      |       |
       |      |       +--X3--+--etc--
       |      |       
       |      +--X3---+--EOF--> "X1,X3"
       |             
       +--X2--+--EOF--> "X2"
       |      |
       |      +--X3---+--EOF--> "X2,X3"
       |      |       |
       ...

This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...

Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.

Alex Feinman 2010-01-04 21:30:05

Answer 5

A:

If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:

List<SomeType> list1 = ...;
List<Set<SomeType>> candidateFromList2 = ...;
if (Sets.cartesianProduct(candidateFromList2).contains(list1)) { ... }

But it's not that hard to check this "longhand" either.

Kevin Bourrillion 2010-01-05 01:42:30

Answer 6

+1 A:

There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.

from collections import defaultdict

# index is HashMap<X, HashMap<Y, HashMap<Z, Integer>>>
index = defaultdict(lambda: defaultdict(dict))
for rowid, (x, y, z) in enumerate(list1):
    index[x][y][z] = rowid

for rowid2, (xs, ys, zs) in enumerate(list2):
    xhits = defaultdict(list)
    for x in xs:
        if x in index:
            for y, zmap in index[x].iteritems():
                xhits[y].append(zmap)

    yhits = defaultdict(list)
    for y in ys:
        if y in xhits:
            for z, rowid1 in xhits[y].iteritems():
                yhits[z].append(rowid1)

    for z in zs:
        if z in yhits:
            for rowid1 in yhits[z]:
                print "list1[%d] matches list2[%d]" % (hit[z], rowid2)

The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.

Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).

Jason Orendorff 2010-01-05 20:41:05

ansaurus

tags:

views:

answers:

Efficient Matching Algorithm for Set Based Triplets

related questions