tags:

views:

158

answers:

2

I have a bunch of files that need to be parsed, and they all have one of two date patterns in the file name (we're upgrading our system, and we need to have the file parser be able to recognize both date formats, new and old).

The filenames look like either <fileroot>_yyyyMMdd.log or <fileroot>_MMddyy.log, and I need to be able to parse out the numbers to parse the dates, however, whenever I try to use a regular expression like ^.*(\\d{6,8}).*$ or ^.*(\\d{6}|\\d{8}).*$ to parse out the numbers of the date, the capture group is always 6 characters in length, even for the file names that are 8 digits.

Is there any way to force the regular expression library in C# to be as exhaustive as possible in trying to match a regular expression? I know how to do it in Java, just not C# / .NET, I'm pretty new at the language.

+1  A: 

If you know that the date is always followed by a known string, I'd change the regex to force matching that string:

^.*(\\d{6,8})\.log$

This will force the regex engine to consume all 8 digits in order to match the trailing \.log.

JSBangs
Tried it, doesn't work. The (apparently default) lazy matching of the .NET regex engine gets '091117' when compared against fileroot_20091117.log using that regex.
Alex Marshall
+2  A: 

The problem is here: ".*". Regex is greedy so it matches as many symbols as it can. Including two first digits

Solutions:

1) .*_(\\d{6,8}) - if you always have _ before the digits

2) .*[^\\d](\\d{6,8})

3) .*?(\\d{6,8})

You would have the same problem in Java, Regex is greedy everywhere.

yu_sha
#1 did the trick, thank you very much for your help.
Alex Marshall