ansaurus

Question

python regex question

Answer 1

+6 A:

You need:

(.)\.(avi|rar|zip|txt)$

Note the backslash to escape the dot. This will make it look for a literal dot rather than any character.

To make it case insensitive, use the RE.I flag in your search call.

re.search(r'(.)\.(avi|rar|zip|txt)$', string, re.I)

JoshD 2010-10-11 17:56:13

So is there also a flag that makes the Python interpreter case-insensitive? Otherwise we have to `import re as RE` to be able to find `RE.I`...

Nick T 2010-10-11 18:41:16

You can make it vaguely more efficient and less vaguely more precisely what's being looked for by changing it to `.\.(avi|rar|zip|txt)$`: this will ensure that there's some character before the dot, and that the file extension is at the end of the string. This way you end up with the first match being the extension rather than the second one, and you don't end up keeping a match that you don't need.

intuited 2010-10-11 18:48:44

@Nick T: the re.I flag is just for the regular expressions module. I'm not aware of a way to make the rest of python case-insensitive.

JoshD 2010-10-11 19:04:31

@JoshD: I was making some (fail) joke at you messing up case with a flag that sets case-insensitivity. (`RE.I` instead of `re.I`)

Nick T 2010-10-11 19:12:04

@Nick T: Well, nuts. Now I look quite the fool. I fixed the answer, though.

JoshD 2010-10-11 19:23:48

@intuited: aarrgghh ... use `\Z` not `$`

John Machin 2010-10-11 19:25:46

@JoshD: -1. The pattern needs to guard against matching too early ... "a.rare.avian.creature.txt" is a valid filename. You need `\b` or `\Z` at the end of the pattern, depending on what the OP really wants.

John Machin 2010-10-11 19:37:01

@JoshD: also if using re.search, having `blah+` at the start of the pattern when merely `blah` is adequate is at the very least redundant and may cause quadratic behaviour when failing.

John Machin 2010-10-11 19:41:58

@John Machin: Thanks for the pointers. I've altered the answer to quell the misinformation. But what's the justification for \Z vs $? What if there's a multiline list of filenames?

JoshD 2010-10-11 20:19:36

@JoshD: The justification for `blah\Z` in the default non-multiline mode is that `re.match("blah$", "blah\n")` will not return `None`. Multiline list: if you want to muck around emulating some 1970s Unix text editor, feel free to use re.MULTILINE mode. If you want to process data, unpack it from whatever container it's in, getting rid of (non-data) delimiters like `\n`. Then validate your data.

John Machin 2010-10-11 21:29:35

Answer 2

+1 A:

Short interactive run:

>>> import re
>>> pat="(.+)\.(avi|rar|zip|txt)"
>>> re.search(pat, "abcdefg.zip", re.IGNORECASE).groups()
('abcdefg', 'zip')
>>> re.search(pat, "abcdefg.ZIP", re.IGNORECASE).groups()
('abcdefg', 'ZIP')
>>>

gimel 2010-10-11 18:02:40

In this particular case, it's a non issue, but it is recommended for regex literals to be raw strings, to avoid double escaping. use `r"(.+)\.(avi|rar|zip|txt)"`

TokenMacGuy 2010-10-11 23:30:43

Answer 3

A:

(.+)[.](avi|rar|zip|txt)

Then the group 2 will be extension.

I have just written a blog about Regular Expression http://blogs.appframe.com/erikv/2010-09-23-Regular-Expression if you want to read more about this.

sv88erik 2010-10-11 18:04:23

Answer 4

A:

Since I think regex is evil...

def return_extension(filename):
    '''(This functions assumes that filenames such as `.foo` have extension
    `foo`.)
    '''
    tokens = filename.split('.')

    return '' if len(tokens) == 1 else tokens[-1]

...I advocate simply parsing the filename.

Beau Martínez 2010-10-11 18:22:03

Reinventing the wheel but not reinventing the axle is even more evil.

John Machin 2010-10-11 19:57:32

Answer 5

A:

If you know that the extension is at the very end of the string, this should work well:

.\.(avi|rar|zip|txt)$

The first bit will ensure that there's some character before the dot.
The $ specifies that the file extension is at the end of the string, i.e. the $ means "the string ends here". For gory details on this, including some edge cases with newlines that you should be aware of see the comment discussion for JoshD's answer, as well as the entry for $ in the docs.

So then the only entry in the match.groups() tuple, i.e. match.groups()[0], will be the extension itself.

intuited 2010-10-11 18:55:38

@intuited: -1. s/some edge cases/FAIL/

John Machin 2010-10-12 01:28:05

@John Machin: Crap, really? I can't think of any. What's an example?

intuited 2010-10-12 01:41:29

@intuited: """The justification for blah\Z in the default non-multiline mode is that re.match("blah$", "blah\n") will not return None"""

John Machin 2010-10-12 02:22:47

@John Machin: I think you need to re-read my answer, specifically the caveat that "you know that the extension is at the *very end* of the string". This is a pretty common use case (e.g. you've read in and done `split('\n')` on a file listing from a file or pipeline), so it seems worth giving a specific solution for it. In this case I think it's actually better to use the `$` because it's compatible with `fileinput.input()` without having to `rstrip` the lines first.

intuited 2010-10-12 04:20:23

@intuited: I did read that first line, twice, and decided twice not to take issue with it. Third time unlucky: How can one KNOW that the extension is at the very end of the string? In any case, whether you think that you know or not, `\Z` does the job reliably. Another way of looking at it is that `$` is a perlish substitute for `\n?\Z` ... fileinput.input()? oh, yeah, I remember, Python crutch for awk tragics -- I stopped using it some time in 1998.

John Machin 2010-10-12 05:45:15

intuited 2010-10-12 06:00:23

Answer 6

+7 A:

the standard library is better ;)

>>> os.path.splitext('hello.py')
('hello', '.py')

Ant 2010-10-11 19:09:39

+1 this is the right tool for the job!

katrielalex 2010-10-11 19:33:13

ansaurus

tags:

views:

answers:

python regex question

related questions