tags:

views:

66

answers:

3

I have text files formatted as such:

R156484COMP_004A7001_20100104_065119.txt

I need to consistently extract the R****COMP, the 004A7001 number, 20100104 (date), and don't care about the 065119 number. the problem is that not ALL of the files being parsed have the exact naming convention. some may be like this:

R168166CRIT_156B2075_SU2_20091223_123456.txt

or

R285476COMP_SU1_125A6025_20100407_123456.txt

So how could I use regex instead of split to ensure I am always getting that serial (ex. 004A7001), the date (ex. 20100104), and the R****COMP (or CRIT)???

Here is what I do now but it only gets the files formatted like my first example.

if (file.Count(c => c == '_') != 3) continue;

and further down in the code I have:

string RNumber = Path.GetFileNameWithoutExtension(file);

string RNumberE = RNumber.Split('_')[0];

string RNumberD = RNumber.Split('_')[1];

string RNumberDate = RNumber.Split('_')[2];

DateTime dateTime = DateTime.ParseExact(RNumberDate, "yyyyMMdd", Thread.CurrentThread.CurrentCulture);
string cmmDate = dateTime.ToString("dd-MMM-yyyy");

UPDATE: This is now where I am at -- I get an error to parse RNumberDate to an actual date format. "Cannot implicitly convert type 'RegularExpressions.Match' to 'string'

 string RNumber = Path.GetFileNameWithoutExtension(file);

 Match RNumberE = Regex.Match(RNumber, @"^(R|L)\d{6}(COMP|CRIT|TEST|SU[1-9])(?=_)", RegexOptions.IgnoreCase);

 Match RNumberD = Regex.Match(RNumber, @"(?<=_)\d{3}[A-Z]\d{4}(?=_)", RegexOptions.IgnoreCase);
 Match RNumberDate = Regex.Match(RNumber, @"(?<=_)\d{8}(?=_)", RegexOptions.IgnoreCase);



DateTime dateTime = DateTime.ParseExact(RNumberDate, "yyyyMMdd", Thread.CurrentThread.CurrentCulture);
string cmmDate = dateTime.ToString("dd-MMM-yyyy")
+1  A: 

I don't completely understand the rules for parsing your string, but advice that might help is:

Have a look at RegEx.Split and RegEx.Matches to break your string up using a RegEx.

Do create your RegEx, I suggest an excellent RegEx builder/checker/tutorial. With that tool, you can enter a bunch of strings in the big text area (e.g. your serial numbers or whatever they are) and interactively enter your RegEx, seeing which parts currently match. There's a "tutorial" on the right side of the page that will assist you in learning how to build the RegEx.

Eric J.
+2  A: 

You can use the power of multiple regular expressions to solve this problem.

compNumber:   /^R\d{6}(COMP|CRIT)(?=_)/
date:         /(?<=_)\d{8}(?=_)/
serialNumber: /(?<=_)\d{3}[A-Z]\d{4}(?=_)/

part:         /(?<=_).*?(?=_)/

Run each regular expression on the string separately to pull out the parts.

strager
using the regex builder that Eric J. posted below, these look like perfect expressions.. One thing though.. in a few cases there are files formatted like R######COMP_TEST_20100103_123456.txt which don't show a serial number. how can i tell the code to skip a file if this is the case?
jakesankey
shouldn't I be able to do something like this for IF there isn't a serial, just return whatever is right after the first '_' ??(?<=_)\d{3}[A-Z]\d{4}(?=_)|(^_)
jakesankey
@jakesankey, I think you should do that in you C# code, not in the regular expression. It's relatively simple; if the `serialNumber` regexp doesn't match, run a different regexp. I've updated my answer with an expression which may help, though a string split works just as well.
strager
@Kobi, Testing in RegexBuddy, `\b` cannot replace the look-arounds.
strager
@strager - that bit of code for the 'part' returns many strings from the filename. it finds each string after a '_' .. How can I get it to only return the bit between the first and second '_'??
jakesankey
@jakesankey, Only look at the first match, of course.
strager
I am getting an error when parsing the date. Please see the updated code block in my original question.
jakesankey
@jacksankey, `Regex.Match` returns a (`Match`)[http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.match.aspx] and not a string. To get the string of the match, call `Match.Value`. I.e., `DateTime dateTime = DateTime.ParseExact(RNumberDate.Value, "yyyyMMdd", Thread.CurrentThread.CurrentCulture);`
strager
:) Thanks for all the help!
jakesankey
+1  A: 
string filename = "R285476COMP_SU1_125A6025_20100407_123456.txt";

Match m = Regex.Match(filename,
    @"^(R\d+(?:COMP|CRIT))_(?:SU\d+_)?(\d+[A-Z]+\d+)_(?:SU\d+_)?(\d{8})_.*$",
    RegexOptions.IgnoreCase);

if (m.Success)
{
    Console.WriteLine(m.Groups[1].Value);    // R285476COMP
    Console.WriteLine(m.Groups[2].Value);    // 125A6025
    Console.WriteLine(m.Groups[3].Value);    // 20100407
}
LukeH
It may be benefitial to name your groups so you can move things around in the expression without having to change the C# code.
strager
+1 I like this for the answer, with the enhancement strager suggested.
Chris Taylor