tags:

views:

3114

answers:

6

I'm trying to handle a bunch of files, and I need to alter then to remove extraneous information in the filenames; notably, I'm trying to remove text inside parentheses. For example:

filename = "Example_file_(extra_descriptor).ext"

and I want to regex a whole bunch of files where the parenthetical expression might be in the middle or at the end, and of variable length.

What would the regex look like? Perl or Python syntax would be preferred.

+7  A: 
s/\(.*?\)//

So in Python, you'd do:

re.sub('\(.*?\)', '', filename)
Can Berk Güder
Though `'\(.*?\)'` works it is safer to use `r'\(.*?\)'` in general.
J.F. Sebastian
is there any reason to prefer .*? over [^)]*
Kip
@J.F. Sebastian: you're right.
Can Berk Güder
@Kip: nope. I don't know why, but .* is always the first thing that comes to mind.
Can Berk Güder
@Kip: .*? is not handled by all regex parsers, whereas your [^)]* is handled by almost all of them.
X-Istence
@Kip: Another reason is backtracking.
Gumbo
.* gets everything between the first left paren and last right paren: 'a(b)c(d)e' will become 'ae'. [^)]* only removes between the first left paren and the first right paren: 'ac(d)e'. You'll also get different behaviors for nested parens.
daotoad
Oops, I was wrong in the last comment. The '?' in the .* example makes it behave like the negated character class. But then, so was Gumbo, since there won't be any backtracking with the non-greedy .*? construct.
daotoad
+1  A: 

If you can stand to use sed (possibly execute from within your program, it'd be as simple as:

sed 's/(.*)//g'

samoz
You are just grouping the expression `.*`.
Gumbo
@Gumbo: No, he's not. In sed, "\(...\)" groups.
runrig
Ops, sorry. Didn’t know that.
Gumbo
+8  A: 

I would use:

\([^)]*\)
Gumbo
+2  A: 

If a path may contain parentheses then the r'\(.*?\)' regex is not enough:

import os, re

def remove_parenthesized_chunks(path, safeext=True, safedir=True):
    dirpath, basename = os.path.split(path) if safedir else ('', path)
    name, ext = os.path.splitext(basename) if safeext else (basename, '')
    name = re.sub(r'\(.*?\)', '', name)
    return os.path.join(dirpath, name+ext)

By default the function preserves parenthesized chunks in directory and extention parts of the path.

Example:

>>> f = remove_parenthesized_chunks
>>> f("Example_file_(extra_descriptor).ext")
'Example_file_.ext'
>>> path = r"c:\dir_(important)\example(extra).ext(untouchable)"
>>> f(path)
'c:\\dir_(important)\\example.ext(untouchable)'
>>> f(path, safeext=False)
'c:\\dir_(important)\\example.ext'
>>> f(path, safedir=False)
'c:\\dir_\\example.ext(untouchable)'
>>> f(path, False, False)
'c:\\dir_\\example.ext'
>>> f(r"c:\(extra)\example(extra).ext", safedir=False)
'c:\\\\example.ext'
J.F. Sebastian
A: 
>>> import re
>>> filename = "Example_file_(extra_descriptor).ext"
>>> p = re.compile(r'\([^)]*\)')
>>> re.sub(p, '', filename)
'Example_file_.ext'
Selinap
+1  A: 

If you don't absolutely need to use a regex, useconsider using Perl's Text::Balanced to remove the parenthesis.

use Text::Balanced qw(extract_bracketed);

my ($extracted, $remainder, $prefix) = extract_bracketed( $filename, '()', '[^(]*' );

{   no warnings 'uninitialized';

    $filename = (defined $prefix or defined $remainder)
                ? $prefix . $remainder
                : $extracted;
}

You may be thinking, "Why do all this when a regex does the trick in one line?"

$filename =~ s/\([^}]*\)//;

Text::Balanced handles nested parenthesis. So $filename = 'foo_(bar(baz)buz)).foo' will be extracted properly. The regex based solutions offered here will fail on this string. The one will stop at the first closing paren, and the other will eat them all.

$filename =~ s/([^}]*)//; # returns 'foo_buz)).foo'

$filename =~ s/(.)//; # returns 'foo.foo'

# text balanced example returns 'foo_).foo'

If either of the regex behaviors is acceptable, use a regex--but document the limitations and the assumptions being made.

daotoad
While I know you can't parse nested parenthesis with (classic) regexes, if you know you're never going to encounter nested parenthesis, you can simplify the problem to one that CAN be done with regexes, and fairly easily. It's overkill to use a parser tool when we don't need it.
Chris Lutz
@Chris Lutz - I should have said "consider" rather than "use" in the first sentence. In many cases a regex will do the job, which is why I said to use a regex if the behavior is acceptable.
daotoad