tags:

views:

706

answers:

8

How can I fix this RegEx to optionally capture a file extension?

I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)


My RegEx (.NET Flavor) is as follows:

.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.*                   # Ignore some garbage in the front
(header_             # Match the start of the file name,
    \d{10,11}_)      #     including the ID (10 - 11 digits)
.*                   # Ignore the type code in the middle
(_.*_\d{8})          # Match some random characters, then an 8-digit date
.*                   # Ignore anything between this and the file extension
(\.\w{3,4})          # Match the file extension, 3 or 4 characters long
.*                   # Ignore the rest of the string


I expect this to match strings like:

str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"


Where the capture groups return something like:

$1  =  header_0000000602_
$2  =  _mc2e1nrobr1a3s55niyrrqvy_20081212
$3  =  .doc


Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.

If I add "?" to the end of the third capture group "(.\w{3,4})?", the RegEx no longer captures $3 for any string. If I add "+" instead "(.\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected.

I feel that using "?" at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. I am probably being too naive with the ".*" sections that I use to ignore parts of the string.


Doesn't Work As Expected:

.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*
+2  A: 

One possibility is that the second to last .* is being greedy. You might try changing it to:

.*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.*
                             ^ Added that

That wasn't correct, this one will match the input you supplied, but it assumes that the first . it encounters is the start of a file extension:

.*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.*

Edit: Remove the escaping I had in the second regex.

Sean Bright
Thanks for the suggestion, but that doesn't change my results.
EndangeredMassa
Good one, Sean! Here's the pattern I had come up with :.*?(header_\d*_).*?(_\d{8})[^\.]*(\.[a-zA-Z0-9]{3,4})?
Cerebrus
@Sean: Your edit works great. Thanks!
EndangeredMassa
@Cerebrus: Your answer works too. You should post that as an answer instead of a comment.
EndangeredMassa
+1  A: 

Well, .* is probably the wrong way to start the regex- it will match 0 or more (*) single characters of anything (.) ...which means your entire file name will be matched by that alone. If you leave that off the regex will start matching when it reaches header which is what you want. You could also replace it with \w, which matches word breaks. I also suggest using a tool such as The Regex Coach so you can step through it and see exactly what's wrong and what your capture groups will be.

Factor Mystic
If I replace the first ".*" with "\w?", I get the same results. But, thanks for the suggestion. I'll use "\w?" to be more clear in my RegEx.
EndangeredMassa
Thanks for the mention of Regex Coach. That thing is awesome.
EndangeredMassa
A: 

which regex implementation are you using?

klyde
I am using .NET. I should have mentioned it in the body, but I did tag it as such. *Edited to include the mention of .NET RegEx.
EndangeredMassa
+2  A: 

I believe the problem is in your 3rd .*, which you annotated above with "Ignore anything between this and the file extension". It's greedy, so it will match ANYTHING. When you make the extension pattern optional, the 3rd .* matches up to the end of the string, which is allowed. Assuming that there will NEVER be a '.' character in that extraneous bit, you can replace .* with [^.]* and the rest will hopefully work after you restore the ? that you had to remove.

Eddie
That does work. I think I can assume that a period will never come before the file extension, in this case. Thanks!
EndangeredMassa
A: 

This works for the examples you've posted:

^.*?(?<header>\d+)_.*?_(?<date>\d{8}).*?(?:\.(?<ext>\w{3,4}))?[\w\s\[\]]*$

I'm assuming that the text "header" and the random characters between that and the date aren't important, so those aren't captured by this regex. I also used the .NET named capture feature for clarity, but be aware that it isn't supported in other flavors of RegEx.

If the text after the file name contains any non-alphanumeric characters other than [ and ], the pattern will need to be revised.

Daniel Schaffer
+1  A: 

Specify in your second match that you only want to match all characters that do not have the period in them then do your match for your extension.

".*(header_\d{10,11}_).*(_.*_\d{8})[^.]*(\.\w{3,4})?"
David Morton
That works. Thanks!
EndangeredMassa
A: 

Here is one that works for what you're posting:

^.*(?<header>header_\d{10,11})_.*(?<date>_[a-z0-9]+_\d{8})(\[\d+\])(?<ext>(\.[a-zA-Z0-9]{3,4})?).*

The replacement is:

Header: $1
Date: $2
Extension: $4

I didn't use the named groups in the replacement because I couldn't figure out how to get TextMate to do it, but the named groups were helpful to force the capture.

scott
+1  A: 

This is your correct result

.*?(header_\d*_).*?(_.*_.{8})[^.]*(\.\w{3,4})?.*
-------------------------------------------
.*?                  # Prevent a greedy match
(header_             # 
    \d{10,11}_)      # 
.*?                  # Prevent a greedy match
(_.*_\d{8})          # 
[^.]*                # Take everything that is NOT a period
(\.\w{3,4})          # Match the extension
.*                   #

The implicit assumption is that the period will be the beginning of a file extension after the digits match. The following wouldn't meet this requirement:

string unmatched = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].foobar.txt"

Also, when taking out your groups in .NET make sure your code looks like this:

regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value
regex.Match(string_to_match).Groups[3].Value

and not this:

// 0 index == string_to_match
regex.Match(string_to_match).Groups[0].Value
regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value

This is something that tripped me up at first.

Gavin Miller
Thanks for the note about the Groups index. I noticed that when I started this.
EndangeredMassa