views:

780

answers:

14

What is efficient way to determine if a list is a subset of another list?

Example:

is_subset(List(1,2,3,4),List(2,3))    //Returns true
is_subset(List(1,2,3,4),List(3,4,5))  //Returns false

I am mostly looking for efficient algorithm and not too concern how the list is stored. It can be stored in array, link list or other data structure.

Thanks

EDIT: The list is sorted

+1  A: 

If you're ok with storing the data in a hashset you can simply check whether list1 contains x for each x in list2. Which will be close to O(n) in the size of list2. (Of course you can also do the same with other datastructures, but that will lead to different runtimes).

sepp2k
It should be noted that this may not work if you are concerned about ordering or continuity.
Jason Baker
+1  A: 

This depends highly on the language/toolkit, as well as the size and storage of the lists.

If the lists are sorted, a single loop can determine this. You can just start walking the larger list trying to find the first element of the smaller list (break if you pass it in value), then move on to the next, and continue from the current location. This is fast, since it's a one loop/one pass algorithm.

For unsorted lists, it's often fastest to build some form of hash table from the first list's elements, then search each element in the second list off the hash. This is the approach that many of the .NET LINQ extensions use internally for item searching within a list, and scale quite well (although they have fairly large temporary memory requirements).

Reed Copsey
+7  A: 

If both lists are ordered, one simple solution would be to simultaneously go over both lists (with a two bump pointers in both lists), and verify that all of the elements in the second list appear in the first list (until all elements are found, or until you reach a larger number in the first list).

A pseudo-code in C++ would look something like this:

List l1, l2;
iterator i1 = l1.start();
iterator i2 = l2.start();
while(i1 != l1.end() && i2 != l2.end()) {
  if (*i1 == *i2) {
    i1++;
    i2++;
  } else if (*i1 > *i2) {
    return false;
  } else {
    i1++;
  }
}
return true;

(It obviously won't work as is, but the idea should be clear).

If the lists are not ordered, you can use a hashtable - insert all of your elements in the first list, and then check if all of the elements in the second list appear in the hashtable.

These are algorithmic answers. In different languages, there are default built-in methods to check this.

Anna
A: 

The efficient algorithm uses some kind of state machine, where you keep the accepting states in memory (in python):

def is_subset(l1, l2):
    matches = []
    for e in l1:
        # increment
        to_check = [0] + [i+1 for i in matches]
        matches = [] # nothing matches
        for i in to_check:
            if l2[i] = e:
                if i == len(l2)-1:
                    return True
                matches.append(i)
    return False

EDIT: of course if the list are sorted, you don't need that algorithm, just do:

def is_subset(l1, l2):
    index = 0
    for e in l1:
        if e > l2[index]:
            return False
        elif e == l2[index]:
            index += 1
        else:
            index == 0
        if index == len(l2):
            return True
    return False
tonfa
A: 

You should have a look at the implementation of STL method search. That is the C++ way I think this would be done.

http://www.sgi.com/tech/stl/search.html

Description:

Search finds a subsequence within the range [first1, last1) that is identical to [first2, last2) when compared element-by-element.

martsbradley
This wouldn't work in the following example: (1, 2, 3, 4, 5), (1, 2, 5).
mos
+6  A: 

Just wanted to mention that Python has a method for this:

return set(list2).issubset(list1)

Or:

return set(list2) <= set(list1)
Nimrod
+3  A: 

If you're concerned about ordering or continuity, you may need to use the Boyer-Moore or the Horspool algorithm.

The question is, do you want to consider [2, 1] to be a subset of [1, 2, 3]? Do you want [1, 3] to be considered a subset of [1, 2, 3]? If the answer is no to both of these, you might consider one of the algorithms linked above. Otherwise, you may want to consider a hash set.

Jason Baker
or else you can do some preprocessing (if it's worth it, i.e. you will use the larger list over and over) and make a suffix tree or suffix array.
San Jacinto
+15  A: 

For C++, the best way is to use std::includes algorithm:

#include <algorithm>

std::list<int> l1, l2;
...
// Test whether l2 is a subset of l1
bool is_subset = std::includes(l1.begin(), l1.end(), l2.begin(), l2.end());

This requires both lists to be sorted, as specified in your question. Complexity is linear.

Pavel Minaev
+1 for using the standard libraries
luke
A: 

If the lists are ordered you can do the following in PHP:

function is_subset($array, $search)
{
    if (strpos(serialize($array), serialize($search)) !== false)
    {
     return true;
    }

    return false;
}

EDIT: Seems like the $needle argument in in_array() can be an array, so the following also works for (un?)ordered lists:

in_array($search, $array);
Alix Axel
FFS, why all the down votes?! I really don't get it...
Alix Axel
My -1: this will work most of the time, but is liable to break if any of the items in $array looks like a serialised string -- then you can get false positives. The problem is that strpos() isn't smart enough to know where the true boundaries between elements are.
j_random_hacker
+1  A: 
func isSubset ( @list, @possibleSubsetList ) {
    if ( size ( @possibleSubsetList ) > size ( @list ) ) {
        return false;
    }
    for ( @list : $a ) {
        if ( $a != @possibleSubsetList[0] ) {
            next;
        } else {
            pop ( @possibleSubsetList );
        }
    }
    if ( size ( @possibleSubsetList ) == 0 ) {
        return true;
    } else {
        return false;
    }
}

O(n) viola. of course, isSubset( (1,2,3,4,5), (2,4) ) will return true

John
+2  A: 

Scala, assuming you mean subsequence by subset:

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean =
  (l1 indexOfSeq l2) > 0

Anyway, a subsequence is just a substring problem. Optimal algorithms include Knuth-Morris-Pratt and Boyer-Moore, and a few more complex ones.

If you truly meant subset, though, and thus you are speaking of Sets and not Lists, you can just use the subsetOf method in Scala. Algorithms will depend on how the set is stored. The following algorithm works for a list storage, which is a very suboptimal one.

def is_subset[A,B](l1: List[A], l2: List[B]): Boolean = (l1, l2) match {
  case (_, Nil) => true
  case (Nil, _) => false
  case (h1 :: t1, h2 :: t2) if h1 == h2 => is_subset(t1, t2)
  case (_ :: tail, list) => is_subset(tail, list)
}
Daniel
+2  A: 

For indexOfSeq in scala trunk I implemented KMP, which you can examine: SequenceTemplate

extempore
+1 for implementing KMP. :-)
Daniel
+8  A: 

Here are a few trade offs you can make. Let's assume that you have two sets of elements, S and T, drawn from a universe U. We want to determine if S≥T. In one of the given examples, we have

S={1,2,3,4}
T={3,4,5}
U={1,2,3,4,5}

1. Sorted Lists (or balanced search tree)
The method suggested by most posters. If you already have sorted lists, or don't care about the length of time it takes to create them (say, you're not doing that often), then this algorithm is basically linear time and space. This is usually the best option.

(To be fair to other choices here, the time and space bounds should actually contain factors of "Log |U|" in appropriate places, but this is usually not relivant)

Data structures: Sorted list for each of S and T. Or a balanced search tree (e.g. AVL tree, red-black tree, B+-tree) that can be iterated over in constant space.

Algorithm: For each element in T, in order, search S linearly for that element. Remember where you left off each search, and start the next search there. If every search succeeds, then S≥T.

Time complexity: about O( |S| Log|S| + |T| Log|T| ) to create the sorted lists, O( max(|S|, |T|) ) to compare.

Space complexity: about O( |S| + |T| )

Example (C++)

#include <set>
#include <algorithm>

std::set<int> create_S()
{
    std::set<int> S;
    // note: std::set will put these in order internally
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::set<int> create_T()
{
    std::set<int> T;
    // note std::set will put these in order internally
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

int main()
{
    std::set<int> S=create_S();
    std::set<int> T=create_T();
    return std::includes(S.begin(),S.end(), T.begin(), T.end());
}

2. Hash tables
Better average time complexity than with a sorted list can be obtained using hash tables. The improved behavior for large sets comes at the cost of generally poorer performance for small sets.

As with sorted lists, I'm ignoring the complexity contributed by the size of the universe.

Data structure: Hash table for S, anything quickly iterable for T.

Algorithm: Insert each element of S into its hashtable. Then, for each element in T, check to see if it's in the hash table.

Time complexity: O( |S| + |T| ) to set up, O( |T| ) to compare.

Space complexity: O( |S| + |T| )

Example (C++)

#include <tr1/unordered_set>

std::tr1::unordered_set<int> create_S()
{
    std::tr1::unordered_set<int> S;
    S.insert(3);
    S.insert(2);
    S.insert(4);
    S.insert(1);
    return S;
}

std::tr1::unordered_set<int> create_T()
{
    std::tr1::unordered_set<int> T;
    T.insert(4);
    T.insert(3);
    T.insert(5);
    return T;
}

bool includes(const std::tr1::unordered_set<int>& S, 
              const std::tr1::unordered_set<int>& T)
{
    for (std::tr1::unordered_set<int>::const_iterator iter=T.begin();
         iter!=T.end();
         ++iter)
    {
        if (S.find(*iter)==S.end())
        {
            return false;
        }
    }
    return true;
}

int main()
{
    std::tr1::unordered_set<int> S=create_S();
    std::tr1::unordered_set<int> T=create_T();
    return includes(S,T);
}

3. Bit sets
If your universe is particularly small (let's say you can only have elements 0-32), then a bitset is a reasonable solution. The running time (again, assuming you don't care about setup time) is essentially constant. In the case you do care about setup, it's still faster than creating a sorted list.

Unfortunately, bitsets become unwieldy very quickly for even a moderately sized universe.

Data structure: bit vector (usually a machine integer) for each of S and T. We might encode S=11110 and T=00111, in the given example.

Algorithm: Calculate the intersection, by computing the bitwise 'and' of each bit in S with the corresponding bit in T. If the result equals T, then S≥T.

Time complexity: O( |U| + |S| + |T| ) to setup, O( |U| ) to compare.

Space complexity: O( |U| )

Example: (C++)

#include <bitset>

// bitset universe always starts at 0, so create size 6 bitsets for demonstration.
// U={0,1,2,3,4,5}

std::bitset<6> create_S()
{
    std::bitset<6> S;
    // Note: bitsets don't care about order
    S.set(3);
    S.set(2);
    S.set(4);
    S.set(1);
    return S;
}

std::bitset<6> create_T()
{
    std::bitset<6> T;
    // Note: bitsets don't care about order
    T.set(4);
    T.set(3);
    T.set(5);
    return T;
}

int main()
{
    std::bitset<6> S=create_S();
    std::bitset<6> T=create_T();

    return S & T == T;
}

4. Bloom filters
All the speed benefits of bitsets, without the pesky limitation on universe size the bitsets have. Only one down side: they sometimes (often, if you're not careful) give the wrong answer: If the algorithm says "no", then you definitely don't have inclusion. If the algorithm says "yes", you might or might not. Better accuracy is attained by choosing a large filter size, and good hash functions.

Given that they can and will give wrong answers, Bloom filters might sound like a horrible idea. However, they have definite uses. Generally one would use Bloom filters to do many inclusion checks quickly, and then use a slower deterministic method to guarantee correctness when needed. The linked Wikipedia article mentions some applications using Bloom filters.

Data structure: A Bloom filter is a fancy bitset. Must choose a filter size, and hash functions beforehand.

Algorithm (sketch): Initialize the bitset to 0. To add an element to a bloom filter, hash it with each hash function, and set the corresponding bit in the bitset. Determining inclusion works just as for bitsets.

Time complexity: O( filter size )

Space complexity: O( filter size )

Probability of correctness: Always correct if it answers for "S does not include T". Something like 0.6185^(|S|x|T|/(filter size))) if it answers "S includes T". In particular, the filter size must be chosen proportional to the product of |S| and |T| to give reasonable probability of accuracy.

Managu
Comprehensive! +1.
j_random_hacker
A: 

You can see the problem to check if a list is a subset of another list as the same problem to verify if a substring belongs to a string. The best known algorithm for this is the KMP (Knuth-Morris-Pratt). Look at wikipedia for a pseudo-code or just use some String.contains method available in the language of your preference. =)

darlinton