tags:

views:

42

answers:

1

I scanned 2,8GB XML file for positions (Index) of particular tags. The I use Seek method to set a start point in that file. File is UTF-8 encoded. So indexing is like that:


using(StreamReader sr = new StreamReader(pathToFile)){
  long index = 0;
  while(!sr.EndOfStream){
    string line = sr.ReadLine();
    index += (line.Length + 2); //remeber of \r\n chars

    if(LineHasTag(line)){
      SaveIndex(index-line.Length); //need beginning of the line
    }
  }
}

So afterwards I have in another file indexed positions. But when I use seek it doesn't seem to be good, because the position is set somewhere before it should be. I have loaded some content of that file into char array and I manually checked the good index of a tag I need. It's the same as I indexed by code above. But still Seek method on StreamReader.BaseStream places the pointer earlier in the file. Quite strange.

Any suggestions?

Best regards, ventus

+2  A: 

Seek deals in bytes - you're assuming there's one byte per character. In UTF-8, one character in the BMP can take up to three bytes.

My guess is that you've got non-ASCII characters in your file - those will take more than one byte.

I think there may also be a potential problem with the byte order mark, if there is one. I can't remember offhand whether StreamReader will swallow that automatically - which would put you 3 bytes to start with.

Jon Skeet
Well, that's true, there are non-ASCII characters in that file. Then how to index it correctly?
Ventus
@Ventus: You could call `Encoding.GetByteCount(line)` and hope that it all keeps in sync. I can't immediately think of situations where it wouldn't.
Jon Skeet
@Jon: It seems to be better now. I have to test it now. I can that still sometimes it's not perfect, but I hope it's going to be OK.
Ventus