views:

99

answers:

4

Given a number of lists of items, find the lists with matching items.

The brute force pseudo-code for this problem looks like:

foreach list L
    foreach item I in list L
        foreach list L2 such that L2 != L  
            for each item I2 in L2
                if I == I2
                    return new 3-tuple(L, L2, I) //not important for the algorithm

I can think of a number of different ways of going about this - creating a list of lists and removing each candidate list after searching the others for example - but I'm wondering if there is a better algorithm for this?

I'm using Java, if that makes a difference to your implementation.

Thanks

A: 

You can use a trie, modified to record what lists each node belongs to.

Mau
can you elaborate on how this would work, and what the time complexity would be? I don't see how a trie would apply here.
Peter Recore
+1  A: 

As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.

Map<Object, List>

of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.

fuzzy lollipop
ahh, neat solution, but not quite what i was looking for - i want any matching pairs, rather than items that appear in every list. in other words, if item 1 appears in lists A and B and C, i would have three matches - A,B - B,C and A,C
MalcomTucker
then you need to edit your question with those details, they are not clear in your question
fuzzy lollipop
good stuff, thanks
MalcomTucker
I think the pseudo code accurately represents what Malcom is looking for.
Peter Recore
+3  A: 
  1. Create a Map<Item,List<List>>.
  2. Iterate through every item in every list.
  3. each time you touch an item, add the current list to that item's entry in the Map.

You now have a Map entry for each item that tells you what lists that item appears in.

This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).

Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.

Peter Recore
oh good. you are concerned mostly with time, not space. my caveat is not important then.
Peter Recore
nice, thanks, that looks good.
MalcomTucker
Tomer Vromen
@Tomer Vromen - I think that depends on the type of map. I did say "about" O(N), not exactly. I will make it more clear that there is a bit of a fudge factor there.
Peter Recore
@Tomer Vromen - here's what wikipedia has to say about adding an item to a hash: "In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table."
Peter Recore
This answer is O-tilde(N), which means O(N) up to logarithmic factors. This is a good answer and it is equivalent to the one that I thought of: Sort pairs (item,listname) on the first key and then use consecutive stretches to see the list intersections.
Greg Kuperberg
On the other hand, the "map of lists" structure is one that I use all the time in web programming in Python; I made my own class for it called a "collation".
Greg Kuperberg
+1  A: 

I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.

I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).

Comparing three lists: L0, L1 and L2.

  1. Read L0, placing each element in a set: S0.
  2. Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
  3. Discard S0.
  4. Read L2, keeping items that match an element of S1 and discarding others.

Update Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)

Update 2 Some untested C++ code to illustrate the algorithm

#include <string>
#include <vector>
#include <set>
#include <cassert>

typedef std::vector<std::string> strlist_t;

strlist_t GetMatches(std::vector<strlist_t> vLists)
{
    assert(vLists.size() > 1);
    std::set<std::string> s0, s1;
    std::set<std::string> *pOld = &s1;
    std::set<std::string> *pNew = &s0;

    // unconditionally load first list as "new"
    s0.insert(vLists[0].begin(), vLists[0].end());

    for (size_t i=1; i<vLists.size(); ++i)
    {
        //swap recently read "new" to "old" now for comparison with new list
        std::swap(pOld, pNew);
        pNew->clear();

        // only keep new elements if they are matched in old list
        for (size_t j=0; j<vLists[i].size(); ++j)
        {
            if (pOld->end() != pOld->find(vLists[i][j]))
            {
                // found match
                pNew->insert(vLists[i][j]);
            }
        }
    }
    return strlist_t(pNew->begin(), pNew->end());
}
Michael J