views:

409

answers:

2

Hi.

My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).

I tried the naive and easy way first, using a StreamWriter and accessing the underlying BaseStream.Position. Unfortunately that doesn't work as I intended:

Given a file containing the following

Foo
Bar
Baz
Bla
Fasel

and this very simple code

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

the output is:

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..

The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..

A: 

Would this work:

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = 0;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos += line.Length;
  }
}
Sani Huttunen
Unfortunately not, because I have to accept different types of newlines (think this \n, \r\n, \r) and the number would be skewed. This might work if I insist to have a _consistent_ newline separator (it could very well be mixed in practice) and if I probe it first, to know the real offset. So - I'm trying to avoid going down that route.
Benjamin Podszun
@Benjamin: Darn - I just posted a similar answer which explicitly relied on a consistent newline separator...
Jon Skeet
Then I think you'd be better off doing it manually with StreamReader.Read().
Sani Huttunen
@Jon: Hehe. As I said: That _might_ be the way, instead of using a plain Stream - if these are the only two options I've to roll a dice and live with the consequences: Either the consistent separators (bad for files that were processed on more than one platform, copy/pasted in bad editors etc) or the Stream stuff (boring low level line parsing and string encoding mess, a lot of boiler plate code for a seemingly low return)
Benjamin Podszun
@Sani: That wouldn't help much. I have to ditch the whole `StreamReader`. Even `Read()` on it leads to a block read on the underlying stream and moves the `BaseStream.Position` to 25 for my sample. After _one char_.
Benjamin Podszun
+3  A: 

You could create a TextReader wrapper, which would track the current position in the base TextReader :

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

You could then use it as follows :

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}
Thomas Levesque
Seems to work. That somehow seems so obvious now.. Thanks a lot.
Benjamin Podszun
This solution is fine as long as you want the character position, rather than the byte position.If the underlying file has a Byte Order Mark (BOM) it will offset, or if it uses multi-byte characters, the 1:1 correspondence between characters and bytes no longer holds.
FrederikB
Agreed, only works for single byte encoded characters e.g. ASCII. If for instance your underlying file is Unicode, each character will be 2 or 4 byte encoded. The implementation above is working on a character stream, not a byte stream, so you will get character offsets which will not map onto the actual byte positions as each character can be 2 or 4 bytes. For example, the second character position will be reported as index 1, but the byte position will actually be index 2 or 4. If there is a BOM (Byte Order Mark) this will again add extra bytes to the true underlying byte position.
chibacity