views:

392

answers:

4

Given a set of phrases, i would like to filter the set of all phrases that contain any of the other phrases. Contained here means that if a phrase contains all the words of another phrase it should be filtered out. Order of the words within the phrase does not matter.

What i have so far is this:

  1. Sort the set by the number of words in each phrase.
  2. For each phrase X in the set:
    1. For each phrase Y in the rest of the set:
      1. If all the words in X are in Y then X is contained in Y, discard Y.

This is slow given a list of about 10k phrases. Any better options?

A: 

sort phrases by their contents, i.e., 'Z A' -> 'A Z', then eliminating phrases is easy going from shortest to longer ones.

Victor Sorokin
+1  A: 

You could build an index which maps words to phrases and do something like:

let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch

After this, matched would contain all phrases matching all words in the target phrase. It could be pretty well optimized by initializing matched to the set of all phrases containing only words[0], and then only iterating over words[1..words.length]. Filtering phrases which are too short to match the target phrase may improve performance, too.

Unless I'm mistaken, a simple implementation has a worst case complexity (when the search phrase matches all phrases) of O(n·m), where n is the number of words in the search phrase, and m is the number of phrases.

gustafc
+1  A: 

Your algo is quadratic in the number of phrases, that's probably what's slowing it down. Here I index phrases by words to get below quadratic in the common case.

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!
redtuna
+1  A: 

This is the problem of finding minimal values of a set of sets. The naive algorithm and problem definition looks like this:

set(s for s in sets if not any(other < s for other in sets))

There are sub-quadratic algorithms to do this (such as this), but given that N is 10000 the efficiency of the implementation probably matters more. The optimal approach depends heavily on the distribution of the input data. Given that the input sets are natural language phrases that mostly differ, the approach suggested by redtuna should work well. Here's a python implementation of that algorithm.

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

This is still quadratic, but on my machine it runs in 350ms for a set of 10k phrases containing 50% of minimal values with words from an exponential distribution.

Ants Aasma