I have a large text file of records, each delimited by a newline. Each record is prefixed by a two digit number which specifies it's type. Here's an example:
....
30AA ALUMINIUM ALLOY LMELMEUSD2.00 0.35 5101020100818
40AADFALUMINIUM ALLOY USD USD 100 1 0.20000 1.00 0 100 140003
50201008180.999993 0.00 0.00 120100818
60 0F 1 222329 1.000000 0 0 -4667 -4667 4667 4667
50201008190.999986 0.00 0.00 120100819
60 0F 1 222300 1.000000 0 0 -4667 -4667 4667 4667
40AADOALUMINIUM ALLOY USD USD 100 1 0.20000 1.00 0 100 140001
50201009150.000000 0.17 0.17 120100915
60 1200C 1 101779 0.999800 0 0 -4666 -4666 4665 4665
60 1200P 1 0 0.000000 0 0 0 0 0 0
60 1225C 1 99279 0.999800 -1 -1 -4667 -4667 4665 4665
60 1225P 1 0 0.000000 0 0 0 0 0 0
60 1250C 1 96780 0.999800 0 0 -4666 -4666 4665 4665
60 1250P 1 0 0.000000 0 0 0 0 0 0
60 1275C 1 94280 0.999800 -1 -1 -4667 -4667 4665 4665
60 1275P 1 0 0.000000 0 0 0 0 0 0
60 1300C 1 91781 0.999800 0 0 -4666 -4666 4665 4665
60 1300P 1 0 0.000000
.......
The file contains a hierarchical relationship, based on the two digit prefixes. You can think of the "30" lines containing "40" lines as it's children; "40" lines containing "50", and "50"s containing "60"s. After parsing, these lines and their associated prefixes will obviously map to a clr type, "30"s mapping to "ContractGroup", "40"s mapping to "InstrumentTypeGroup" "50"s mapping to "ExpirationGroup" etc.
I'm attempting to take a functional approach to the parse, as well as reducing memory consumption with a lazy load approach, since this file is extremely large. My first step is in creating a generator to yield one line at a time, something like this:
public static IEnumerable<string> TextFileLineEnumerator()
{
using (StreamReader sr = new StreamReader("BigDataFile.txt"))
{
while (!sr.EndOfStream)
{
yield return sr.ReadLine();
}
}
}
This allows me to Linq against the text file, and process the lines as a stream.
My problem is attempting to process this stream into it's compositional collection structure, here's a first attempt:
var contractgroups = from strings in TextFileLineEnumerator()
.SkipWhile(s => s.Substring(0, 2) != "30")
.Skip(1) where strings.Substring(0,2) != "30"
select strings;
This gives me all child lines of "30" (but unfortunately omits the "30" line itself.) This query will obviously require subqueries to gather and project the lines (via a select) into their appropriate types, with appropriate compositions (ContractGroups containing a List of InstrumentTypeGroups, etc.)
This problem more than likely boils down to my lack of experience with functional programming, so if anyone has any pointers on this sort of parsing, that would be helpful, thanks-