Here are a few trade offs you can make. Let's assume that you have two sets of elements, S and T, drawn from a universe U. We want to determine if S≥T. In one of the given examples, we have
S={1,2,3,4}
T={3,4,5}
U={1,2,3,4,5}
1. Sorted Lists (or balanced search tree)
The method suggested by most posters. If you already have sorted lists, or don't care about the length of time it takes to create them (say, you're not doing that often), then this algorithm is basically linear time and space. This is usually the best option.
(To be fair to other choices here, the time and space bounds should actually contain factors of "Log |U|" in appropriate places, but this is usually not relivant)
Data structures: Sorted list for each of S and T. Or a balanced search tree (e.g. AVL tree, red-black tree, B+-tree) that can be iterated over in constant space.
Algorithm: For each element in T, in order, search S linearly for that element. Remember where you left off each search, and start the next search there. If every search succeeds, then S≥T.
Time complexity: about O( |S| Log|S| + |T| Log|T| ) to create the sorted lists, O( max(|S|, |T|) ) to compare.
Space complexity: about O( |S| + |T| )
Example (C++)
#include <set>
#include <algorithm>
std::set<int> create_S()
{
std::set<int> S;
// note: std::set will put these in order internally
S.insert(3);
S.insert(2);
S.insert(4);
S.insert(1);
return S;
}
std::set<int> create_T()
{
std::set<int> T;
// note std::set will put these in order internally
T.insert(4);
T.insert(3);
T.insert(5);
return T;
}
int main()
{
std::set<int> S=create_S();
std::set<int> T=create_T();
return std::includes(S.begin(),S.end(), T.begin(), T.end());
}
2. Hash tables
Better average time complexity than with a sorted list can be obtained using hash tables. The improved behavior for large sets comes at the cost of generally poorer performance for small sets.
As with sorted lists, I'm ignoring the complexity contributed by the size of the universe.
Data structure: Hash table for S, anything quickly iterable for T.
Algorithm: Insert each element of S into its hashtable. Then, for each element in T, check to see if it's in the hash table.
Time complexity: O( |S| + |T| ) to set up, O( |T| ) to compare.
Space complexity: O( |S| + |T| )
Example (C++)
#include <tr1/unordered_set>
std::tr1::unordered_set<int> create_S()
{
std::tr1::unordered_set<int> S;
S.insert(3);
S.insert(2);
S.insert(4);
S.insert(1);
return S;
}
std::tr1::unordered_set<int> create_T()
{
std::tr1::unordered_set<int> T;
T.insert(4);
T.insert(3);
T.insert(5);
return T;
}
bool includes(const std::tr1::unordered_set<int>& S,
const std::tr1::unordered_set<int>& T)
{
for (std::tr1::unordered_set<int>::const_iterator iter=T.begin();
iter!=T.end();
++iter)
{
if (S.find(*iter)==S.end())
{
return false;
}
}
return true;
}
int main()
{
std::tr1::unordered_set<int> S=create_S();
std::tr1::unordered_set<int> T=create_T();
return includes(S,T);
}
3. Bit sets
If your universe is particularly small (let's say you can only have elements 0-32), then a bitset is a reasonable solution. The running time (again, assuming you don't care about setup time) is essentially constant. In the case you do care about setup, it's still faster than creating a sorted list.
Unfortunately, bitsets become unwieldy very quickly for even a moderately sized universe.
Data structure: bit vector (usually a machine integer) for each of S and T. We might encode S=11110 and T=00111, in the given example.
Algorithm: Calculate the intersection, by computing the bitwise 'and' of each bit in S with the corresponding bit in T. If the result equals T, then S≥T.
Time complexity: O( |U| + |S| + |T| ) to setup, O( |U| ) to compare.
Space complexity: O( |U| )
Example: (C++)
#include <bitset>
// bitset universe always starts at 0, so create size 6 bitsets for demonstration.
// U={0,1,2,3,4,5}
std::bitset<6> create_S()
{
std::bitset<6> S;
// Note: bitsets don't care about order
S.set(3);
S.set(2);
S.set(4);
S.set(1);
return S;
}
std::bitset<6> create_T()
{
std::bitset<6> T;
// Note: bitsets don't care about order
T.set(4);
T.set(3);
T.set(5);
return T;
}
int main()
{
std::bitset<6> S=create_S();
std::bitset<6> T=create_T();
return S & T == T;
}
4. Bloom filters
All the speed benefits of bitsets, without the pesky limitation on universe size the bitsets have. Only one down side: they sometimes (often, if you're not careful) give the wrong answer: If the algorithm says "no", then you definitely don't have inclusion. If the algorithm says "yes", you might or might not. Better accuracy is attained by choosing a large filter size, and good hash functions.
Given that they can and will give wrong answers, Bloom filters might sound like a horrible idea. However, they have definite uses. Generally one would use Bloom filters to do many inclusion checks quickly, and then use a slower deterministic method to guarantee correctness when needed. The linked Wikipedia article mentions some applications using Bloom filters.
Data structure: A Bloom filter is a fancy bitset. Must choose a filter size, and hash functions beforehand.
Algorithm (sketch): Initialize the bitset to 0. To add an element to a bloom filter, hash it with each hash function, and set the corresponding bit in the bitset. Determining inclusion works just as for bitsets.
Time complexity: O( filter size )
Space complexity: O( filter size )
Probability of correctness: Always correct if it answers for "S does not include T". Something like 0.6185^(|S|x|T|/(filter size))) if it answers "S includes T". In particular, the filter size must be chosen proportional to the product of |S| and |T| to give reasonable probability of accuracy.