ansaurus

Question

An efficient way to find matching items in N lists?

Answer 1

A:

You can use a trie, modified to record what lists each node belongs to.

Mau 2010-07-09 15:52:11

can you elaborate on how this would work, and what the time complexity would be? I don't see how a trie would apply here.

Peter Recore 2010-07-09 16:10:50

Answer 2

+1 A:

As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.

Map<Object, List>

of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.

fuzzy lollipop 2010-07-09 16:02:00

ahh, neat solution, but not quite what i was looking for - i want any matching pairs, rather than items that appear in every list. in other words, if item 1 appears in lists A and B and C, i would have three matches - A,B - B,C and A,C

MalcomTucker 2010-07-09 16:05:35

then you need to edit your question with those details, they are not clear in your question

fuzzy lollipop 2010-07-09 16:09:07

good stuff, thanks

MalcomTucker 2010-07-09 16:17:24

I think the pseudo code accurately represents what Malcom is looking for.

Peter Recore 2010-07-09 17:03:55

Answer 3

+3 A:

Create a Map<Item,List<List>>.
Iterate through every item in every list.
each time you touch an item, add the current list to that item's entry in the Map.

You now have a Map entry for each item that tells you what lists that item appears in.

This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).

Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.

Peter Recore 2010-07-09 16:05:57

oh good. you are concerned mostly with time, not space. my caveat is not important then.

Peter Recore 2010-07-09 16:07:04

nice, thanks, that looks good.

MalcomTucker 2010-07-09 16:07:22

Tomer Vromen 2010-07-09 16:14:22

@Tomer Vromen - I think that depends on the type of map. I did say "about" O(N), not exactly. I will make it more clear that there is a bit of a fudge factor there.

Peter Recore 2010-07-09 16:27:41

@Tomer Vromen - here's what wikipedia has to say about adding an item to a hash: "In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table."

Peter Recore 2010-07-09 16:33:33

This answer is O-tilde(N), which means O(N) up to logarithmic factors. This is a good answer and it is equivalent to the one that I thought of: Sort pairs (item,listname) on the first key and then use consecutive stretches to see the list intersections.

Greg Kuperberg 2010-07-09 16:38:35

On the other hand, the "map of lists" structure is one that I use all the time in web programming in Python; I made my own class for it called a "collation".

Greg Kuperberg 2010-07-09 16:40:34

Answer 4

+1 A:

I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.

I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).

Comparing three lists: L0, L1 and L2.

Read L0, placing each element in a set: S0.
Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
Discard S0.
Read L2, keeping items that match an element of S1 and discarding others.

Update Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)

Update 2 Some untested C++ code to illustrate the algorithm

#include <string>
#include <vector>
#include <set>
#include <cassert>

typedef std::vector<std::string> strlist_t;

strlist_t GetMatches(std::vector<strlist_t> vLists)
{
    assert(vLists.size() > 1);
    std::set<std::string> s0, s1;
    std::set<std::string> *pOld = &s1;
    std::set<std::string> *pNew = &s0;

    // unconditionally load first list as "new"
    s0.insert(vLists[0].begin(), vLists[0].end());

    for (size_t i=1; i<vLists.size(); ++i)
    {
        //swap recently read "new" to "old" now for comparison with new list
        std::swap(pOld, pNew);
        pNew->clear();

        // only keep new elements if they are matched in old list
        for (size_t j=0; j<vLists[i].size(); ++j)
        {
            if (pOld->end() != pOld->find(vLists[i][j]))
            {
                // found match
                pNew->insert(vLists[i][j]);
            }
        }
    }
    return strlist_t(pNew->begin(), pNew->end());
}

Michael J 2010-07-09 16:17:06

ansaurus

tags:

views:

answers:

An efficient way to find matching items in N lists?

related questions