tags:

views:

118

answers:

6

Hello,

What is the correct regex statement using re.search() to find and return a file extension in a string.

Such as: (.+).(avi|rar|zip|txt)

I need it to search a string and if it contains any of those avi, rar, etc) return just that extension.

Thanks!

EDIT: should add that is needs to be case insensitive

+6  A: 

You need:

(.)\.(avi|rar|zip|txt)$

Note the backslash to escape the dot. This will make it look for a literal dot rather than any character.

To make it case insensitive, use the RE.I flag in your search call.

re.search(r'(.)\.(avi|rar|zip|txt)$', string, re.I)
JoshD
So is there also a flag that makes the Python interpreter case-insensitive? Otherwise we have to `import re as RE` to be able to find `RE.I`...
Nick T
You can make it vaguely more efficient and less vaguely more precisely what's being looked for by changing it to `.\.(avi|rar|zip|txt)$`: this will ensure that there's some character before the dot, and that the file extension is at the end of the string. This way you end up with the first match being the extension rather than the second one, and you don't end up keeping a match that you don't need.
intuited
@Nick T: the re.I flag is just for the regular expressions module. I'm not aware of a way to make the rest of python case-insensitive.
JoshD
@JoshD: I was making some (fail) joke at you messing up case with a flag that sets case-insensitivity. (`RE.I` instead of `re.I`)
Nick T
@Nick T: Well, nuts. Now I look quite the fool. I fixed the answer, though.
JoshD
@intuited: aarrgghh ... use `\Z` not `$`
John Machin
@JoshD: -1. The pattern needs to guard against matching too early ... "a.rare.avian.creature.txt" is a valid filename. You need `\b` or `\Z` at the end of the pattern, depending on what the OP really wants.
John Machin
@JoshD: also if using re.search, having `blah+` at the start of the pattern when merely `blah` is adequate is at the very least redundant and may cause quadratic behaviour when failing.
John Machin
@John Machin: Thanks for the pointers. I've altered the answer to quell the misinformation. But what's the justification for \Z vs $? What if there's a multiline list of filenames?
JoshD
@JoshD: The justification for `blah\Z` in the default non-multiline mode is that `re.match("blah$", "blah\n")` will not return `None`. Multiline list: if you want to muck around emulating some 1970s Unix text editor, feel free to use re.MULTILINE mode. If you want to process data, unpack it from whatever container it's in, getting rid of (non-data) delimiters like `\n`. Then validate your data.
John Machin
+1  A: 

Short interactive run:

>>> import re
>>> pat="(.+)\.(avi|rar|zip|txt)"
>>> re.search(pat, "abcdefg.zip", re.IGNORECASE).groups()
('abcdefg', 'zip')
>>> re.search(pat, "abcdefg.ZIP", re.IGNORECASE).groups()
('abcdefg', 'ZIP')
>>> 
gimel
In this particular case, it's a non issue, but it is recommended for regex literals to be raw strings, to avoid double escaping. use `r"(.+)\.(avi|rar|zip|txt)"`
TokenMacGuy
A: 
(.+)[.](avi|rar|zip|txt)

Then the group 2 will be extension.

I have just written a blog about Regular Expression http://blogs.appframe.com/erikv/2010-09-23-Regular-Expression if you want to read more about this.

sv88erik
A: 

Since I think regex is evil...

def return_extension(filename):
    '''(This functions assumes that filenames such as `.foo` have extension
    `foo`.)
    '''
    tokens = filename.split('.')

    return '' if len(tokens) == 1 else tokens[-1]

...I advocate simply parsing the filename.

Beau Martínez
Reinventing the wheel but not reinventing the axle is even more evil.
John Machin
A: 

If you know that the extension is at the very end of the string, this should work well:

.\.(avi|rar|zip|txt)$
  • The first bit will ensure that there's some character before the dot.

  • The $ specifies that the file extension is at the end of the string, i.e. the $ means "the string ends here". For gory details on this, including some edge cases with newlines that you should be aware of see the comment discussion for JoshD's answer, as well as the entry for $ in the docs.

So then the only entry in the match.groups() tuple, i.e. match.groups()[0], will be the extension itself.

intuited
@intuited: -1. s/some edge cases/FAIL/
John Machin
@John Machin: Crap, really? I can't think of any. What's an example?
intuited
@intuited: """The justification for blah\Z in the default non-multiline mode is that re.match("blah$", "blah\n") will not return None"""
John Machin
@John Machin: I think you need to re-read my answer, specifically the caveat that "you know that the extension is at the *very end* of the string". This is a pretty common use case (e.g. you've read in and done `split('\n')` on a file listing from a file or pipeline), so it seems worth giving a specific solution for it. In this case I think it's actually better to use the `$` because it's compatible with `fileinput.input()` without having to `rstrip` the lines first.
intuited
@intuited: I did read that first line, twice, and decided twice not to take issue with it. Third time unlucky: How can one KNOW that the extension is at the very end of the string? In any case, whether you think that you know or not, `\Z` does the job reliably. Another way of looking at it is that `$` is a perlish substitute for `\n?\Z` ... fileinput.input()? oh, yeah, I remember, Python crutch for awk tragics -- I stopped using it some time in 1998.
John Machin
intuited
+7  A: 

the standard library is better ;)

>>> os.path.splitext('hello.py')
('hello', '.py')
Ant
+1 this is the right tool for the job!
katrielalex