views:

69

answers:

1

I have a large text file of records, each delimited by a newline. Each record is prefixed by a two digit number which specifies it's type. Here's an example:

....

30AA ALUMINIUM ALLOY     LMELMEUSD2.00  0.35         5101020100818
40AADFALUMINIUM ALLOY USD USD 100   1       0.20000    1.00   0 100  140003
50201008180.999993  0.00  0.00  120100818
60       0F     1  222329 1.000000      0      0  -4667  -4667   4667   4667
50201008190.999986  0.00  0.00  120100819
60       0F     1  222300 1.000000      0      0  -4667  -4667   4667   4667
40AADOALUMINIUM ALLOY USD USD 100   1       0.20000    1.00   0 100  140001
50201009150.000000  0.17  0.17  120100915
60    1200C     1  101779 0.999800      0      0  -4666  -4666   4665   4665
60    1200P     1       0 0.000000      0      0      0      0      0      0
60    1225C     1   99279 0.999800     -1     -1  -4667  -4667   4665   4665
60    1225P     1       0 0.000000      0      0      0      0      0      0
60    1250C     1   96780 0.999800      0      0  -4666  -4666   4665   4665
60    1250P     1       0 0.000000      0      0      0      0      0      0
60    1275C     1   94280 0.999800     -1     -1  -4667  -4667   4665   4665
60    1275P     1       0 0.000000      0      0      0      0      0      0
60    1300C     1   91781 0.999800      0      0  -4666  -4666   4665   4665
60    1300P     1       0 0.000000

.......

The file contains a hierarchical relationship, based on the two digit prefixes. You can think of the "30" lines containing "40" lines as it's children; "40" lines containing "50", and "50"s containing "60"s. After parsing, these lines and their associated prefixes will obviously map to a clr type, "30"s mapping to "ContractGroup", "40"s mapping to "InstrumentTypeGroup" "50"s mapping to "ExpirationGroup" etc.

I'm attempting to take a functional approach to the parse, as well as reducing memory consumption with a lazy load approach, since this file is extremely large. My first step is in creating a generator to yield one line at a time, something like this:

 public static IEnumerable<string> TextFileLineEnumerator()
 {
     using (StreamReader sr = new StreamReader("BigDataFile.txt"))
     {
         while (!sr.EndOfStream)
         {
             yield return sr.ReadLine();
         }
     }
 }

This allows me to Linq against the text file, and process the lines as a stream.

My problem is attempting to process this stream into it's compositional collection structure, here's a first attempt:

  var contractgroups =   from strings in TextFileLineEnumerator()
                          .SkipWhile(s => s.Substring(0, 2) != "30")
                            .Skip(1) where strings.Substring(0,2) != "30"
                              select strings;

This gives me all child lines of "30" (but unfortunately omits the "30" line itself.) This query will obviously require subqueries to gather and project the lines (via a select) into their appropriate types, with appropriate compositions (ContractGroups containing a List of InstrumentTypeGroups, etc.)

This problem more than likely boils down to my lack of experience with functional programming, so if anyone has any pointers on this sort of parsing, that would be helpful, thanks-

+1  A: 

It's not totally clear to me exactly what you're trying to do, but how I would approach this problem would be to first write a PartitionLines function like this:

public static IEnumerable<IEnumerable<string>> PartitionLines(
    this IEnumerable<string> source,
    Func<string, string> groupMarkerSelector,
    string delimeter)
{
    List<string> currentGroup = new List<string>();

    foreach (string line in source)
    {
        var key = groupMarkerSelector(line);
        if (delimeter == key && currentGroup.Count > 0)
        {
            yield return currentGroup;
            currentGroup = new List<string>();
        }

        currentGroup.Add(line);
    }

    if (currentGroup.Count > 0)
        yield return currentGroup;
}

(Note that my function loads a "group" at time into memory; I assume this is OK.)

I'd then take something like this:

var line30Groups =
    TextFileLineEnumerator().
    PartitionLines(l => l.Substring(0, 2), "30");

Now you've got the lines in groups, with a new group of lines starting each time you see a "30." You could subdivide further:

var line3040Groups =
    TextFileLineEnumerator().
    PartitionLines(l => l.Substring(0, 2), "30").Select(g =>
        g.PartitionLines(l => l.Substring(0, 2), "40"));

Now you've got the lines in groups under the "30", and each group is an enumerable of groups under each child "40." And so on.

This is untested and could be cleaner, but you get the picture, I hope.

mquander
I think you'll want to `yield return currentGroup.ToArray()` or something like that instead of `currentGroup` itself since otherwise the OP could end up calling `PartitionLines(s => s.Substring(0, 2), "30").ToList()` and getting a whole bunch of instances of the same `List<string>` object having a single set of elements.
Dan Tao
Dan Tao, I agree, I just hastily screwed it up. I think the cleanest way is to `currentGroup = new List<string>()` instead of clearing it. I edited my post.
mquander
Excellent solution mquander. I did run into the issue Dan mentioned, regarding the repeated List<string> instances, but ToList'ed the yielded currentGroup as per his recommendation, which fixed the issue. Thank you for your efforts!
Pierreten
Just saw the edit, that works too and doesn't force an evaluation immeadiately which is more desirable
Pierreten