This is the problem of finding minimal values of a set of sets. The naive algorithm and problem definition looks like this:
set(s for s in sets if not any(other < s for other in sets))
There are sub-quadratic algorithms to do this (such as this), but given that N is 10000 the efficiency of the implementation probably matters more. The optimal approach depends heavily on the distribution of the input data. Given that the input sets are natural language phrases that mostly differ, the approach suggested by redtuna should work well. Here's a python implementation of that algorithm.
from collections import defaultdict
def find_minimal_phrases(phrases):
# Make the phrases hashable
phrases = map(frozenset, phrases)
# Create a map to find all phrases containing a word
phrases_containing = defaultdict(set)
for phrase in phrases:
for word in phrase:
phrases_containing[word].add(phrase)
minimal_phrases = []
found_superphrases = set()
# in sorted by length order to find minimal sets first thanks to the
# fact that a.superset(b) implies len(a) > len(b)
for phrase in sorted(phrases, key=len):
if phrase not in found_superphrases:
connected_phrases = [phrases_containing[word] for word in phrase]
connected_phrases.sort(key=len)
superphrases = reduce(set.intersection, connected_phrases)
found_superphrases.update(superphrases)
minimal_phrases.append(phrase)
return minimal_phrases
This is still quadratic, but on my machine it runs in 350ms for a set of 10k phrases containing 50% of minimal values with words from an exponential distribution.