ansaurus

Question

Algorithm to filter a set of all phrases containing in other phrase

Answer 1

A:

sort phrases by their contents, i.e., 'Z A' -> 'A Z', then eliminating phrases is easy going from shortest to longer ones.

Victor Sorokin 2009-09-03 10:12:19

Answer 2

+1 A:

You could build an index which maps words to phrases and do something like:

let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch

After this, matched would contain all phrases matching all words in the target phrase. It could be pretty well optimized by initializing matched to the set of all phrases containing only words[0], and then only iterating over words[1..words.length]. Filtering phrases which are too short to match the target phrase may improve performance, too.

Unless I'm mistaken, a simple implementation has a worst case complexity (when the search phrase matches all phrases) of O(n·m), where n is the number of words in the search phrase, and m is the number of phrases.

gustafc 2009-09-03 10:24:14

Answer 3

+1 A:

Your algo is quadratic in the number of phrases, that's probably what's slowing it down. Here I index phrases by words to get below quadratic in the common case.

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

redtuna 2009-09-03 10:47:38

Answer 4

+1 A:

This is the problem of finding minimal values of a set of sets. The naive algorithm and problem definition looks like this:

set(s for s in sets if not any(other < s for other in sets))

There are sub-quadratic algorithms to do this (such as this), but given that N is 10000 the efficiency of the implementation probably matters more. The optimal approach depends heavily on the distribution of the input data. Given that the input sets are natural language phrases that mostly differ, the approach suggested by redtuna should work well. Here's a python implementation of that algorithm.

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

This is still quadratic, but on my machine it runs in 350ms for a set of 10k phrases containing 50% of minimal values with words from an exponential distribution.

Ants Aasma 2009-09-03 10:50:09

ansaurus

tags:

views:

answers:

Algorithm to filter a set of all phrases containing in other phrase

related questions