ansaurus

Question

How can I translate the following filename to a regular expression in Python?

Answer 1

+4 A:

To avoid confusion, read the following, in order.

First, you have the glob module, which handles file name regular expressions just like the Windows and unix shells.

Second, you have the fnmatch module, which just does pattern matching using the unix shell rules.

Third, you have the re module, which is the complete set of regular expressions.

Then ask another, more specific question.

S.Lott 2008-11-21 21:14:03

Thank you very much! Will do.

Amara 2008-11-21 21:16:38

Answer 2

+1 A:

Your question is a bit unclear. You say you want a regular expression, but could it be that you want a glob-style pattern you can use with commands like ls? glob expressions and regular expressions are similar in concept but different in practice (regular expressions are considerably more powerful, glob style patterns are easier for the most common cases when looking for files.

Also, what do you consider to be the pattern? Certainly, * (glob) or .* (regex) will match the pattern. Also, *test.ext (glob) or ._test.ext (regexp) pattern would match, as would many other variations.

Can you be more specific about the pattern? For example, you might describe it as "b, followed by digits, followed by cv, followed by digits ..."

Once you can precisely explain the pattern in your native language (and that must be your first step), it's usually a fairly straight-forward task to translate that into a glob or regular expression pattern.

Bryan Oakley 2008-11-21 21:33:58

Answer 3

A:

if the letters are unimportant, you could try \w\d\d\d\w\w\d\d_test.ext which would match the letter/number pattern, or b\d\d\dcv\d\d_test.ext or some mix of the two.

Awateru 2008-11-21 23:07:05

Answer 4

A:

When working with regexes I find the Mochikit regex example to be a great help.

/^b\d\d\dcv\d\d_test\.ext$/

Then use the python re (regex) module to do the match. This is of course assuming regex is really what you need and not glob as the others mentioned.

Brian C. Lane 2008-11-22 05:37:30

Answer 5

+3 A:

I would like the pattern to be as follows: must start with 'b', followed by three digits, followed by 'cv', followed by two digits, then an underscore, followed by 'release', followed by .'ext'

^b\d{3}cv\d{2}_release\.ext$

Jan Goyvaerts 2008-11-22 07:47:13

You forgot to escape the period.

titaniumdecoy 2008-11-22 07:55:47

Fixed That For You

S.Lott 2008-11-22 14:32:52

Answer 6

A:

b[0-9]{3}cv[0-9]{2}_release\.ext

titaniumdecoy 2008-11-22 07:53:32

Answer 7

+10 A:

Now that you have a human readable description of your file name, it's quite straight forward to translate it into a regular expression (at least in this case ;)

must start with

The caret (^) anchors a regular expression to the beginning of what you want to match, so your re has to start with this symbol.

'b',

Any non-special character in your re will match literally, so you just use "b" for this part: ^b.

followed by [...] digits,

This depends a bit on which flavor of re you use:

The most general way of expressing this is to use brackets ([]). Those mean "match any one of the characters listed within. [ASDF] for example would match either A or S or D or F, [0-9] would match anything between 0 and 9.

Your re library probably has a shortcut for "any digit". In sed and awk you could use [[:digit:]] [sic!], in python and many other languages you can use \d.

So now your re reads ^b\d.

followed by three [...]

The most simple way to express this would be to just repeat the atom three times like this: \d\d\d.

Again your language might provide a shortcut: braces ({}). Sometimes you would have to escape them with a backslash (if you are using sed or awk, read about "extended regular expressions"). They also give you a way to say "at least x, but no more than y occurances of the previous atom": {x,y}.

No you have: ^b\d{3}

followed by 'cv',

Literall matching again, now we have ^b\d{3}cv

followed by two digits,

We already covered this: ^b\d{3}cv\d{2}.

then an underscore, followed by 'release', followed by .'ext'

Again, this should all match literally, but the dot (.) is a special character. This means you have to escape it with a backslash: ^\d{3}cv\d{2}_release\.ext

Leaving out the backslash would mean that a filename like "b410cv11_test_ext" would also match, which may or may not be a problem for you.

Finally, if you want to guarantee that there is nothing else following ".ext", anchor the re to the end of the thing to match, use the dollar sign ($).

Thus the complete regular expression for your specific problem would be:

^b\d{3}cv\d{2}_release\.ext$

Easy.

Whatever language or library you use, there has to be a reference somewhere in the documentation that will show you what the exact syntax in your case should be. Once you have learned to break down the problem into a suitable description, understanding the more advanced constructs will come to you step by step.

hop 2008-11-22 11:09:25

Like the way you wrote it :-)

e-satis 2008-11-23 18:27:45

And by the way, I'd like to point to http://www.dabeaz.com/generators/Generators.pdf as a nice introduction to generators where you can learn how to create a Python replacement for "find" and "grep".

e-satis 2008-11-23 18:38:38

Thank you very much for a very clear, concise, and detailed explanation. I also liked the way you arrived at the expression - I will be using this step-wise methodology until the process becomes more second nature for me. The perfect explanation for a beginner. Thanks again!

Amara 2008-11-24 14:59:41

ansaurus

tags:

views:

answers:

How can I translate the following filename to a regular expression in Python?

related questions