ansaurus

Question

How to correct bugs in this Damerau-Levenshtein implementation?

Answer 1

+1 A:

Do some elementary debugging. You know that it is going wrong in the 2nd output line marked #ED B. The wrong values seem to indicate that it finds one edit early on and never finds any more. This is possibly because one of the min() args is somehow clamped at 1. Print deletion_cost, substitution_cost, addition_cost ... which is wrong? Why is it wrong? Print the input text values. Temporarily disable the transposition section to see if that makes the problem go away. Check and re-check the _warp caper (a tricksy hobbit gimmick if I ever saw one) and the usage thereof. What happens if you compare "aaaaa" with "aaaaa"? "qwerty" with "qwerty"? "xxxxx" with "yyyyy"? Does the problem happen with all of bytes, bytearray and str input?

The free problem: I'd suspect corruption, not dizzyness. Print the three arrays; are their contents as expected? Try enabling the free() one array at a time -- all broken? only one? which one?

Some asides on memory management: You may like to read this and consider using the Python-specific routines instead of malloc/free. Downsizing your array if there have been surrogates seems over the top.

Update: Followed my own suggestions. Deletion cost was stuffed. "oneago" was same as "thisrow". Problem causing both the wrong answer and the doubled (-! not corrupted !-) free: circular shuffle of pointers wasn't circular.

# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###

Update 2: [comment capacity too small] No mojo, just plain ordinary debugging spadework, as I suggested. "concentrating on this for my fix" is not "super-readible". The reference code does create a new list for each pass, which it CAN do because thisrow refers to nothing carried over from the previous pass. It doesn't NEED to do this, and in fact the initialisation apart from the first and last elements could consist of random numbers, and are only there to fill out the list so that it can be indexed into instead of appended to as some non-tricksy implementations do. So you can slavishly emulate the "reference implementation", at the cost of doing an extra (wasted) malloc/free, or you could ignore the Python-specific implementation details and use the reference implementation solely as a source of presumably correct answers. Then you could accept my fix, and later go on to saving time by chopping out most of the initialisation of the thisrow array.

Update 3: Here's a replacement reference implementation for you. It allocates 3 rows initially, in order to avoid the overhead of list creation inside the outer loop. It also avoids the unnecessary initialisation of all but the last element of thisrow. This eases the translation into C/Cython.

def damlevref2(seq1, seq2):
    # For Python 2.x as was the original.
    # Appears to work on Python 1.5.2 as well :-)
    seq2len = len(seq2)
    twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
    oneago = [-666] * (seq2len + 1) # ditto
    thisrow = range(1, seq2len + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
        thisrow[-1] = x + 1
        for y in xrange(seq2len):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[seq2len - 1]

John Machin 2010-08-07 23:55:21

"tricksy hobbit gimmick"---the _warp() function just made it easier to keep the original idea of the code to access the last and next to last elements with indices -1 and -2. that way, i could keep code juggling at a minimum. i also felt really dumb watching myself writing two functions to find the minimum of two and three unsigned integers respectively. OTH i am new to C / Cython, just wanted it to work, and the LOC count / intellectual overhead is negligible. testing goes on tomorrow, and thx for the many useful tips.

flow 2010-08-08 00:27:17

couldn't wait to test: i switched from malloc to PyMem_Malloc, then disabled the swapping---it's the swapping that cause the glibc problem after all. i seemingly solved the issue by swapping aliases, and call PyMem_Free on the original, unswapped pointers.

flow 2010-08-08 00:49:56

"seemingly" etc: This is equivalent to "I waved a dead chicken at the volcano and it stopped erupting". Swapping 2 pointers instead of circular-shifting 3 pointers was the problem. Which malloc/free team is used is irrelevant to the problem. N.B. I detected the problem before reading your above comment, which in any case is only semi-understandable even now :-)

John Machin 2010-08-08 01:19:03

i find my own comments are written in a super-readable style, but maybe that's just me. i have to say "seemingly" here because the problem did not occur in a 100% reliable manner. thanks you detected the problem faster than me just by using your mojo. it's 4 o'clock in the morning here and i have another strong suspect: whereas my version does swap three array pointers, the reference code creates a new list on each pass: `twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]`. i am concentrating on this for my fix. and yes, `PyMem_Malloc` did not visibly change anything.

flow 2010-08-08 02:10:23

ansaurus

tags:

views:

answers:

How to correct bugs in this Damerau-Levenshtein implementation?

related questions