tags:

views:

114

answers:

7

I am reading a list of strings, each of which relate to a file name. However, each string is minus the extension. I have come up with the following code:

import re
item_list = ['item1', 'item2']
search_list = ['item1.exe', 'item2.pdf']
matches = []
for item in item_list:
    # Match item in search_list using re - I assume this is the best way to do this
    regex = re.compile("^"+item+"\.")
    for file in search_list:
        if regex.match(file):
            matches.append((item, file))

As for duplicate matches, I'm not intensely worried about two files being named 'foo.bar' and 'foo.foo.bar'. That being said, is there a better way of doing this?

Thank you.

A: 

Here's another way to do it that is likely faster than Alex's original code:

item_list = ['item1', 'item2']
search_list = ['item1.exe', 'item2.pdf']
matches = []
for item in item_list:
    for filename in search_list:
        if filename.partition(".")[0] == item:
            matches.append((item,filename))
Series8217
+2  A: 

You could combine all the items into one regexp like this which will be more efficient

import re
item_list = ['item1', 'item2']
regex = re.compile("^("+"|".join(item_list)+")\.")
search_list = ['item1.exe', 'item2.pdf']
matches = []
for file in search_list:
    match = regex.match(file)
    if match:
        matches.append((match.group(1), file))

A better solution might be to parse the filenames using os.path functions though to parse out the basenames and look for them in a set.

Nick Craig-Wood
If the items can contain regex-special punctuation like `.`, you'll need to `re.escape` each item in `item_list` before joining.
bobince
Thanks Nick, this post deserves a hundred useful votes! Found the timeit module and ran tests based on my original algorithm, Dave Kirby's algorithm, and yours. The results were as follows:alex_k : 15.93dave_kirby : 6.98nick_craig_wood : 0.24
Alex
+2  A: 

Use splitext to get the filename without the extension:

import os.path

for item in item_list:
    for filename in search_list:
        if item == os.path.splitext(filename)[0]:
            matches.append((item, file))

It's more correct, but it's also easier to understand what your intention is from reading the code. Alternatively, if you want to allow foo to match foo.bar.txt then use filename.startswith(item + '.') instead.

Mark Byers
+1 for splitext. Accurately does what it says; more readable than regex.
bobince
A: 

I think you should use .rsplit(".",1) for that purpose, regex aren't overkill?

>>> item_list = ['item1', 'item2','item3']
>>> search_list = ['item1.exe', 'item2.pdf','item9999.txt']
>>>
>>> [(x.rsplit(".",1)[0],x) for x in search_list if x.rsplit(".",1)[0] in item_list]
[('item1', 'item1.exe'), ('item2', 'item2.pdf')]

or with for loop

matches=[]
for x in search_list:
    y=x.rsplit(".",1)[0]
    if y in item_list:
        matches.append((y,x))
S.Mark
+1  A: 

You do not need to use a regex for this since you are doing exact string matches (no wildcards, groups etc) - you can use str.startswith(..) instead. This is equivalent to your code:

for item in item_list:
    match = item + "."
    for file in search_list:
        if file.startswith(match)
            matches.append((item, file))

However Nick Craig-Wood's suggestion of compiling all the matches into a single regex may be more efficient - I suggest you benchmark both if speed is an issue.

Dave Kirby
Any tools/commands to help benchmark would be a +1!
Alex
A: 
>>> for file in search_list:
...  tomatch=file.split(".")[0]
...  if tomatch in item_list:
...     found=item_list.index(tomatch)
...     matches.append( ( file, item_list[found] ) )
...
>>> print matches
[('item1.exe', 'item1'), ('item2.pdf', 'item2')]
>>>

No need for regex.

+1  A: 

Avoid re unless you really need it. For simple string matching, you don't really need it.

Mark Byers's answer duplicates the original behaviour of keeping matches in item_list-order. If you don't need that, you could do it even more simply/quickly:

for file in search_list:
    item= os.path.splitext(file)[0]
    if item in item_list:
        matches.append((item, file))

If you don't need to keep the (item) matched either (since it's redundant from the filename anyway), you've got a one-liner:

matches= [file for file in search_list if os.path.splitext(file)[0] in item_list]
bobince
They do need to be matched, but thanks for giving a great example of a one-liner!
Alex