views:

2183

answers:

5

Hi,

I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example:

SECTION_A 256
SECTION_B 344
SECTION_C 556

This means, that SECTION_A starts in line 256.

What would be the best way to parse this heading into a dictionary and then when necessary read the sections.

Typical scenarios would be:

  1. Parse the header and read only section SECTION_B
  2. Parse the header and read fist paragraph of each section.

The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it.

I'd appreciate your suggestions. My environment is VS 2008 and C# 3.5 SP1.

+2  A: 

Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good.

Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? The only way to know is to start from the beginning and start counting.

The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there.

Is this a one-off processing situation? If not, have you considered stuffing the entire file into a local database, like a SQLite database? That would allow you to have a direct mapping between line number and its contents. Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way.

Lasse V. Karlsen
thanks, I was affraid that I will have to preprocess the file before I start operating on it. External database is not an option for me, since the datafile changes too often for that, anyways, thanks for your answer
Michal Rogozinski
You can be lazy about it and only do the preprocessing from beginning to the part that you're interested in. Every access from 0 upto that point will be .Seek():able and if you're interested in later parts you can continue preprocessing from where you left at. Also store the index with time stamp and any identifiable information for later use (or outright delete it when data is updated).
Pasi Savolainen
I like Pasi's suggestion, thanks!
Michal Rogozinski
A: 

Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.

mdm20
Probably a good idea to sort the required sections defined in the header by line number, then read them in that order, so that only one pass through is needed.
Winston Smith
A: 

Read the file until the end of the header, assuming you know where that is. Split the strings you've stored on whitespace, like so:

Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline

foreach(string header in headers) {
    var s = header.Split(new[]{' '});
    sectionIndex.Add(s[0], Int32.Parse(s[1]));
}

Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names.

Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of.

Chris Doggett
A: 

You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). You would have the section and line numbers for use in retrieving the data at a later time.

string dataRow = "";

try
{
    TextReader tr = new StreamReader("filename.txt");

    while (true)
    {
        dataRow = tr.ReadLine();
        if (dataRow.Substring(1, 8) != "SECTION_")
            break;
        else
            //Parse line for section code and line number and log values
            continue;
    }
    tr.Close();
}
catch (Exception ex)
{
    MessageBox.Show(ex.Message);
}
Rich.Carpenter
+1  A: 
Jason Williams