tags:

views:

298

answers:

3

In Python, is there a better way to parameterise strings into regular expressions than doing it manually like this:

test = 'flobalob'
names = ['a', 'b', 'c']
for name in names:
    regexp = "%s" % (name)
    print regexp, re.search(regexp, test)

This noddy example tries to match each name in turn. I know there's better ways of doing that, but its a simple example purely to illustrate the point.


The answer appears to be no, there's no real alternative. The best way to paramaterise regular expressions in python is as above or with derivatives such as str.format(). I tried to write a generic question, rather than 'fix ma codez, kthxbye'. For those still interested, I've fleshed out an example closer to my needs here:

for diskfilename in os.listdir(''):
    filenames = ['bob.txt', 'fred.txt', 'paul.txt']
    for filename in filenames:
        name, ext = filename.split('.')
        regexp = "%s.*\.%s" % (name, ext)
        m = re.search(regexp, diskfilename)
        if m:
          print diskfilename, regexp, re.search(regexp, diskfilename)
          # ...

I'm trying to figure out the 'type' of a file based on its filename, of the form <filename>_<date>.<extension>. In my real code, the filenames array is a dict, containing a function to call once a match is found.

Other ways I've considered doing it:

  • Have a regular expression in the array. I already have an array of filenames without any regular expression magic, so I am loathe to do this. I have done this elsewhere in my code and its a mess (though necessary there).

  • Match only on the start of the filename. This would work, but would break with .bak copies of files, etc. At some point I'll probably want to extract the date from the filename so would need to use a regular expression anyway.


Thanks for the responses suggesting alternatives to regular expressions to achieve the same end result. I was more interested in parameterising regular expressions for now and for the future. I never come across fnmatch, so its all useful in the long run.

+4  A: 

Well, as you build a regexp from a string, I see no other way. But you could parameterise the string itself with a dictionary:

d = {'bar': 'a', 'foo': 'b'}
regexp = '%(foo)s|%(bar)s' % d

Or, depending on the problem, you could use list comprehensions:

vlist = ['a', 'b', 'c']
regexp = '|'.join([s for s in vlist])

EDIT: Mat clarified his question, this makes things different and the above mentioned is totally irrelevant.

I'd probably go with an approach like this:

filename = 'bob_20090216.txt'

regexps = {'bob': 'bob_[0-9]+.txt',
           'fred': 'fred_[0-9]+.txt',
           'paul': 'paul_[0-9]+.txt'}

for filetype, regexp in regexps.items():
    m = re.match(regexp, filename)
    if m != None:
        print '%s is of type %s' % (filename, filetype)
paprika
+1 I checked the documentation to make sure, there's no way to do it (other than parametrizing the string as you say). And I don't think Python needs one.
David Zaslavsky
@paprika: I've clarified the example to explain a little better what I'm getting at.@David: Couldn't find anything in the docs myself, but assumed it would be common enough for there to be something - perhaps that something is using strings in this manner.
Mat
`if m:` is sufficient in this case. In general `if obj is not None` is better than `if obj != None`.
J.F. Sebastian
@J.F. Sebastian:Indeed, 'if m: ...' would be enough. I somehow stuck with this since I learned to avoid using the brief 'if v: ...' to check for boolean truth/falseness (which is a whole different story). Could you elaborate on why 'is not' is better? Just because of readability or anything else?
paprika
`is` checks for object identity (object address in memory) therefore It is highly efficient, but I use this form purely for readability.
J.F. Sebastian
A quick timeit test shows no performance gain from using [0-9]+ instead of \d+ -- is there another reason not to use the shorter form?
akaihola
+2  A: 
import fnmatch, os

filenames = ['bob.txt', 'fred.txt', 'paul.txt']

                  # 'b.txt.b' -> 'b.txt*.b'
filepatterns = ((f, '*'.join(os.path.splitext(f))) for f in filenames) 
diskfilenames = filter(os.path.isfile, os.listdir(''))
pattern2filenames = dict((fn, fnmatch.filter(diskfilenames, pat))
                         for fn, pat in filepatterns)

print pattern2filenames

Output:

{'bob.txt': ['bob20090217.txt'], 'paul.txt': [], 'fred.txt': []}

Answers to previous revisions of your question follow:


I don't understand your updated question but filename.startswith(prefix) might be sufficient in your specific case.

After you've updated your question the old answer below is less relevant.


  1. Use re.escape(name) if you'd like to match a name literally.

  2. Any tool available for string parametrization is applicable here. For example:

    import string
    print string.Template("$a $b").substitute(a=1, b="B")
    # 1 B
    

    Or using str.format() in Python 2.6+:

    print "{0.imag}".format(1j+2)
    # 1.0
    
J.F. Sebastian
+1  A: 

may be glob and fnmatch modules can be of some help for you?

SilentGhost