views:

121

answers:

2

After finding the difflib.SequenceMatcher class in Python's standard library to be unsuitable for my needs, a generic "diff"-ing module was written to solve a problem space. After having several months to think more about what it is doing, the recursive algorithm appears to be searching more than in needs to by re-searching the same areas in a sequence that a separate "search thread" may have also examined.

The purpose of the diff module is to compute the difference and similarities between a pair of sequences (list, tuple, string, bytes, bytearray, et cetera). The initial version was much slower than the code's current form, having seen a speed increase by a factor of ten. How can memoization be applied to the following code? What is the best way to rewrite the algorithm to further increase any possible speed?


class Slice:

    __slots__ = 'prefix', 'root', 'suffix'

    def __init__(self, prefix, root, suffix):
        self.prefix = prefix
        self.root = root
        self.suffix = suffix

################################################################################

class Match:

    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value'

    def __init__(self, a, b, prefix, suffix, value):
        self.a = a
        self.b = b
        self.prefix = prefix
        self.suffix = suffix
        self.value = value

################################################################################

class Tree:

    __slots__ = 'nodes', 'index', 'value'

    def __init__(self, nodes, index, value):
        self.nodes = nodes
        self.index = index
        self.value = value

################################################################################

def search(a, b):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = a[a_addr:a_term]
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = b[b_addr:b_term]
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    a_pref, b_pref = a[:a_addr], b[:b_addr]
                    p_tree = search(a_pref, b_pref)
                    # Create suffix tree to search.
                    a_suff, b_suff = a[a_term:], b[b_term:]
                    s_tree = search(a_suff, b_suff)
                    # Make completed slice objects.
                    a_slic = Slice(a_pref, a_root, a_suff)
                    b_slic = Slice(b_pref, b_root, b_suff)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slic, b_slic, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

Reference: How to optimize a recursive algorithm to not repeat itself?

+1  A: 

You could use the memoize decorator from the Python Decorator Library and use it like this:

@memoized
def search(a, b):

The first time you call search with arguments a,b, the result is calculated and memoized (saved in a cache). The second time search is called with the same arguments, the result in returned from the cache.

Note that for the memoized decorator to work, the arguments must be hashable. If a and b are tuples of numbers, then they are hashable. If they are lists then you could convert them to tuples before passing them to search. It doesn't look like search takes dicts as arguments, but if they were, then they would not be hashable and the memoization decorator would not be able to save the result in the cache.

unutbu
Your answer may not be the best, because his algorithm is using list slicing... The algorithm itself may need to refactored.
kzh
@kzh: Could you elaborate? I'm sorry -- I don't follow you. What goes wrong if there is list slicing?
unutbu
He would need to change the dictionaries back to lists for the slicing part. I posted my answer for example.
kzh
+1  A: 

As ~unutbu said, try the memoized decorator and the following changes:

@memoized
def search(a, b):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = list(a)[a_addr:a_term] #change to list
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = list(b)[b_addr:b_term] #change to list
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    a_pref, b_pref = list(a)[:a_addr], list(b)[:b_addr]
                    p_tree = search(a_pref, b_pref)
                    # Create suffix tree to search.
                    a_suff, b_suff = list(a)[a_term:], list(b)[b_term:]
                    s_tree = search(a_suff, b_suff)
                    # Make completed slice objects.
                    a_slic = Slice(a_pref, a_root, a_suff)
                    b_slic = Slice(b_pref, b_root, b_suff)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slic, b_slic, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

For memoization, dictionaries are best, but they cannot be sliced, so they have to be changed to lists as indicated in the comments above.

kzh
Thanks for your help! As noted above, the code was written for sequences [that support slicing] (list, tuple, string, bytes, bytearray, and any custom-made containers). From your change, it appears that immutable sequences are needed for correct memoization.
Noctis Skytower