How can I fix this RegEx to optionally capture a file extension?
I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)
My RegEx (.NET Flavor) is as follows:
.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.* # Ignore some garbage in the front
(header_ # Match the start of the file name,
\d{10,11}_) # including the ID (10 - 11 digits)
.* # Ignore the type code in the middle
(_.*_\d{8}) # Match some random characters, then an 8-digit date
.* # Ignore anything between this and the file extension
(\.\w{3,4}) # Match the file extension, 3 or 4 characters long
.* # Ignore the rest of the string
I expect this to match strings like:
str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"
Where the capture groups return something like:
$1 = header_0000000602_
$2 = _mc2e1nrobr1a3s55niyrrqvy_20081212
$3 = .doc
Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.
If I add "?" to the end of the third capture group "(.\w{3,4})?", the RegEx no longer captures $3 for any string. If I add "+" instead "(.\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected.
I feel that using "?" at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. I am probably being too naive with the ".*" sections that I use to ignore parts of the string.
Doesn't Work As Expected:
.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*