ansaurus

Question

elegant way to match two wildcarded strings

Answer 1

+1 A:

Well, as long as one ? corresponds to one character, then I can suggest a performant and a compact enough method.

def match(str1, str2):
    if len(str1) != len(str2): return False
    for index, ch1 in enumerate(str1):
        ch2 = str2[index]
        if ch1 == '?' or ch2 == '?': continue
        if ch1 != ch2: return False
    return True

>>> ================================ RESTART ================================
>>> 
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
>>> 
>>> match("R??lcopter", "Roflcop?er")
True
>>>

Edit: Part B), brain-fart free now.

def sets_match(set1, set2):
    return any(match(str1, str2) for str1 in set1 for str2 in set2)

>>> ================================ RESTART ================================
>>> 
>>> s1 = set(['a?', 'fg'])
>>> s2 = set(['?x'])
>>> sets_match(s1, s2) # a? = x?
True
>>>

Hamish Grubijan 2010-10-14 19:06:06

? is often one character, but it can be 2 or 3 characters as well

Claudiu 2010-10-14 19:53:52

@Claudiu, in that case I would A) first expand the string into a set of possible strings (the possibilities are limited), and B) then intersect two sets of candidates and check for emptyness (`isdisjoint`). Part B) is trivial, I can add code for it. For some reason I am stuck on part A). But, JoshD has suggested a method which will not trust OCR 100%, which might be necessary.

Hamish Grubijan 2010-10-14 20:02:50

Hmm part A is not that feasible. Let's say only lowercase letters, then ? would yield (26 + 26^2 + 26^3) possibilities, two ? would yield that ^2, which is already 334 million.

Claudiu 2010-10-14 20:22:41

Claudiu, no, you only need to replace '?' with '?', '??', and maybe '???' - depending on how many you want. This should not be computationally hard, but my skills at tail-recursion and other clever techniques suck at the moment, hopefully temporary. However, part B) was described incorrectly, let me add an `O(n1*n2)` in terms of number of strings implementation of `set_match`.

Hamish Grubijan 2010-10-14 20:26:30

oh, got it! it works too, ill put in my version but accept yours. nice list comprehension syntax btw, didnt know about that!

Claudiu 2010-10-14 20:35:38

I'm concerned that even accounting for multiple ?, this may miss false positives. For example: "RoAcopter". In some fonts fl could be upper-case A. Is that a consideration here, or do we always know that we'll have proper `?` in place?

JoshD 2010-10-14 20:39:17

@JoshD: generally the only letters it'll confuse in the fonts i'm using are l and I.

Claudiu 2010-10-14 20:44:49

@JoshD, this answer came with a disclamer - it assumes that OCR positively identifies letters only when it is absolutely sure. A perfect answer would probably contain parts of an improved version of yours and mine combined. That will reduce the risk greatly, but still might have "worst-case" holes in it. But, humans make mistakes as well when reading text :)

Hamish Grubijan 2010-10-14 20:46:32

@Claudiu, Yes, `any` is neat. In fact, the body of the `sets_match` function should be inlined. I have not written any serious Python code, I would not be surprised if there is cleaner syntax. I would wait a good week before accepting an answer just in case.

Hamish Grubijan 2010-10-14 20:49:31

@Hamish: too true.. sometimes it's read something, and I'll think it's wrong, but when I look closer it was correct

Claudiu 2010-10-14 20:49:40

Answer 2

+1 A:

Using the Levenshtein distance may be useful. It will give a value of how similar the strings are to each other. This will work if they are different lengths, too. The linked page has some psuedocode to get you started.

You'll end up with something like this:

>>> match("Roflcopter", "Roflcop?er")
1
>>> match("R??lcopter", "Roflcopter")
2
>>> match("R?lcopter", "Roflcop?er")
3

So you could have a maximum threshold below which you say they may match.

JoshD 2010-10-14 19:54:56

This is a damn good way to go if the OCR did not put a ? where it should have. However, this method will also tell you that strings 'a' and 'w' are a short distance apart, but there is little chance that OCR could not tell the two apart. Some smarter metric than plain threshold is needed.

Hamish Grubijan 2010-10-14 19:59:48

@Hamish Grubijan: I agree. Perhaps a normalized difference would be better. Something like `Levenshtein/length`. Of course, this probably has issues, too. Like you said, some consideration of typical OCR results would be useful to incorporate.

JoshD 2010-10-14 20:12:23

Answer 3

+2 A:

Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:

>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']

then I expand both and use his matching function, which I called matchats:

def matchOCR(l, r):
    for expl in expQuestions(l):
        for expr in expQuestions(r):
            if matchats(expl, expr):
                return True
    return False

Works as desired:

>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True

The matching function:

def matchats(l, r):
    """Match two strings with @ representing exactly 1 char"""
    if len(l) != len(r): return False
    for i, c1 in enumerate(l):
        c2 = r[i]
        if c1 == "@" or c2 == "@": continue
        if c1 != c2: return False
    return True

and the expanding function, where cartesian_product does just that:

def expQuestions(s):
    """For OCR w/ a questionmark in them, expand questions with
    @s for all possibilities"""
    numqs = s.count("?")

    blah = list(s)
    for expqs in cartesian_product([(0,1,2,3)]*numqs):
        newblah = blah[:]
        qi = 0
        for i,c in enumerate(newblah):
            if newblah[i] == '?':
                newblah[i] = '@'*expqs[qi]
                qi += 1
        yield "".join(newblah)

Claudiu 2010-10-14 20:48:13

Cool, at a quick glance - my only issue is with variable names :) Oh, you also need to worry about ? and @ being valid part of the text and not a special symbol. Instead of @ I would chose something that is non-printed instead. Latest Python is Unicode, so you have enough room for weird characters in there. `>>> weird_ch = chr(255)>>> weird_ch'\xff'`

Hamish Grubijan 2010-10-14 21:08:48

yea i picked those values cause ? and @ dont normally appear in the strings. and yea i didnt wanna know what to call those variables and i was lazy... ill have fun debugging it in 2 months probably. at least `newblah` is descriptive in that it is a new `blah`... kind of..

Claudiu 2010-10-14 21:26:57

Gawd, I work with a guy who is just like you. Boy do I hate it when his bug makes it to my plate. :) Funny thing is that he can read his code months after he wrote it. having excellent memory can be a mixed blessing.

Hamish Grubijan 2010-10-14 21:40:17

Answer 4

+1 A:

This might not be the most Pythonic of options, but if a ? is allowed to match any number of characters, then the following backtracking search does the trick:

def match(a,b):
    def matcher(i,j):
        if i == len(a) and j == len(b):
            return True
        elif i < len(a) and a[i] == '?' \
          or j < len(b) and b[j] == '?':
            return i < len(a) and matcher(i+1,j) \
                or j < len(b) and matcher(i,j+1)
        elif i == len(a) or j == len(b):
            return False
        else:
            return a[i] == b[j] and matcher(i+1,j+1)

    return matcher(0,0)

This may be adapted to be more stringent in what to match. Also, to save stack space, the final case (i+1,j+1) may be transformed into a non-recursive solution.

Edit: some more clarification in response to the reactions below. This is an adaptation of a naive matching algorithm for simplified regexes/NFAs (see Kernighan's contrib to Beautiful Code, O'Reilly 2007 or Jurafsky & Martin, Speech and Language Processing, Prentice Hall 2009).

How it works: the matcher function recursively walks through both strings/patterns, starting at (0,0). It succeeds when it reaches the end of both strings (len(a),len(b)); it fails when it encounters two unequal characters or the end of one string while there are still characters to match in the other string.

When matcher encounters a variable (?) in either string (say a), it can do two things: either skip over the variable (matching zero characters), or skip over the next character in b but keep pointing to the variable in a, allowing it to match more characters.

larsmans 2010-10-14 20:50:02

hmm I like it. I briefly reminisced about DFAs when thinking about this, and yours pretty much emulates one. i'd just have to adjust it to match at most 3 wild chars..

Claudiu 2010-10-14 21:52:10

@larsmans This looks intriguing, but could you please break this down a bit - what is going on here?

Hamish Grubijan 2010-10-14 22:08:53

I added an explanation and a few references.

larsmans 2010-10-15 08:41:25

ansaurus

tags:

views:

answers:

elegant way to match two wildcarded strings

related questions