tags:

views:

410

answers:

5

Hi,

What's the easiest way of me converting the simpler regex format that most users are used to into the correct re python regex string?

As an example, I need to convert this:

string = "*abc+de?"

to this:

string = ".*abc.+de.?"

Of course I could loop through the string and build up another string character by character, but that's surely an inefficient way of doing this?

A: 

I'd use replace:

def wildcard_to_regex(str):
    return str.replace("*", ".*").replace("?", .?").replace("#", "\d")

This probably isn't the most efficient way but it should be efficient enough for most purposes. Notice that some wildcard formats allow character classes which are more difficult to handle.

Konrad Rudolph
A: 

Here is a Perl example of doing this. It is simply using a table to replace each wildcard construct with the corresponding regular expression. I've done this myself previously, but in C. It shouldn't be too hard to port to Python.

unwind
+1  A: 

You'll probably only be doing this substitution occasionally, such as each time a user enters a new search string, so I wouldn't worry about how efficient the solution is.

You need to generate a list of the replacements you need to convert from the "user format" to a regex. For ease of maintenance I would store these in a dictionary, and like @Konrad Rudolph I would just use the replace method:

def wildcard_to_regex(wildcard):
    replacements = {
        '*': '.*',
        '?': '.?',
        '+': '.+',
        }
    regex = wildcard
    for (wildcard_pattern, regex_pattern) in replacements.items():
        regex = regex.replace(wildcard_pattern, regex_pattern)
    return regex

Note that this only works for simple character replacements, although other complex code can at least be hidden in the wildcard_to_regex function if necessary.

(Also, I'm not sure that ? should translate to .? -- I think normal wildcards have ? as "exactly one character", so its replacement should be a simple . -- but I'm following your example.)

Paul Stephenson
+2  A: 

.replacing() each of the wildcards is the quick way, but what if the wildcarded string contains other regex special characters? eg. someone searching for 'my.thing*' probably doesn't mean that '.' to match any character. And in the worst case things like match-group-creating parentheses are likely to break your final handling of the regex matches.

re.escape can be used to put literal characters into regexes. You'll have to split out the wildcard characters first though. The usual trick for that is to use re.split with a matching bracket, resulting in a list in the form [literal, wildcard, literal, wildcard, literal...].

Example code:

wildcards= re.compile('([?*+])')
escapewild= {'?': '.', '*': '.*', '+': '.+'}

def escapePart((parti, part)):
    if parti%2==0: # even items are literals
        return re.escape(part)
    else: # odd items are wildcards
        return escapewild[part]

def convertWildcardedToRegex(s):
    parts= map(escapePart, enumerate(wildcards.split(s)))
    return '^%s$' % (''.join(parts))
bobince
+5  A: 

Those don't look like regexps you're trying to translate, they look more like unix shell globs. Python has a module for doing this already. It doesn't know about the "+" syntax you used, but neither does my shell, and I think the syntax is nonstandard.

>>> import fnmatch
>>> fnmatch.fnmatch("fooabcdef", "*abcde?")
True
>>> help(fnmatch.fnmatch)
Help on function fnmatch in module fnmatch:

fnmatch(name, pat)
    Test whether FILENAME matches PATTERN.

    Patterns are Unix shell style:

    *       matches everything
    ?       matches any single character
    [seq]   matches any character in seq
    [!seq]  matches any char not in seq

    An initial period in FILENAME is not special.
    Both FILENAME and PATTERN are first case-normalized
    if the operating system requires it.
    If you don't want this, use fnmatchcase(FILENAME, PATTERN).

>>>
fivebells