tags:

views:

112

answers:

5

I have some text data in this format:

MI
00
3

MD
1
0.0000
MD
2
0.0000
MD
3
0.0000

This block can be repeated and the number of MDs is variable (but always >= 1) and the numeric values following need to be captured for each one.

I have a regex that matches every MD per MI but it will only capture the last MD. Is it possible to capture each MD without knowing in advance how many there are?

EDIT: Per requests... Regex is below; the important part of my question remains "can I capture every MD set?"

MI\r\d\d\r(\d)\r[\s\w]{6}\r(MD\r[\s\d]{2}\r[\s\d\.\-]*\r)+

My language of choice is C# but I'd take an answer in any language because it would at least give me a start.

MD is a data point out of a sulfur detector from the early 90s.

+2  A: 

It is possible, but it will take more than one pass over the data. A regex group can only hold one chunk of information per match. So, you could have an MD group and find all your MD matches or an MI group which contained an MD group and that would find all your MI matches...but the MD group would not be separated out.

One solution is nested regex calls, with the first one finding each MI group and the second one finding each MD group within the MI group.

Brian
A: 

I think this will do it. At least it works with RegexBuddy using Perl.

MD[^MI]*

Data just repeated from above.

EDIT: This seems to capture all MD and the initial MI in its own little block.

MI([^MI]*(MD[^MI]*)*)
Keng
How would you handle the grouping?
Austin Salonen
I guess I don't understand what you mean by grouping. Do you need to tie each MD with the specific MI?
Keng
A: 

I'm not an expert in C#, but in Java, you'd want to change (MD...)+ to ((MD...)+). That way, you can use the outer pair of parentheses to capture all MDs.

Adam Crume
A: 

I would reccomend you implement a state machine for this task..

But here is a regex I think will also work:

MI\r\d\d\r(\d)\r\r(MD\r\d\r[0-9\.]+\r?)*
duckyflip
+3  A: 

Every Match has a Groups collection. In your case Matches[0].Groups[1] would match the MD records, like "MD\n1\n0.0000MD\n2\n0.0000MD\n3\n0.0000".

Every Group has a Captures collection, which you can iterate over to find all MD instances. This will give you one string per MD, so Matches[0].Groups[1].Captures[0] will be "MD\n1\n0.0000".

EDIT: Although you've already accepted the answer, here's a way to parse everything in a single go:

string pat = @"MI[\r\n]*(?<MI1>\d\d)[\r\n]*(?<MI2>\d+)[\r\n]*" +
    @"(MD[\r\n]*(?<MD1>\d+)*[\r\n]*(?<MD2>[\d\.\-]+)+[\r\n]*)*";

var r = new Regex(pat);
foreach (Match match in r.Matches(text))
{
    Console.WriteLine("MI v1:{0} v2:{1}", 
         match.Groups["MI1"], match.Groups["MI2"]);

    if (match.Groups.Count > 2)
        for (var i = 0; i < match.Groups["MD1"].Captures.Count; i++)
            Console.WriteLine("  MD v1:{0} v2:{1}", 
                match.Groups["MD1"].Captures[i], 
                match.Groups["MD2"].Captures[i]);
}

This is the test text I used:

MI
00
3

MD
1
0.1000
MD
2
0.2000
MD
3
0.3000

MI
12
5

MI
24
5

MD
1
0.1000

The output is:

MI v1:00 v2:3
  MD v1:1 v2:0.1000
  MD v1:2 v2:0.2000
  MD v1:3 v2:0.3000
MI v1:12 v2:5
MI v1:24 v2:5
  MD v1:1 v2:0.1000
Andomar
Exactly what I was looking for. Thanks!
Austin Salonen