This is just a personal project I've been digging into. Basically, I parse a text file (say from 20mb up to about 1gb) using StreamReader. The performance is pretty solid, but still... I've been itching to see what would happen if I parse it in binary. Don't misunderstand, I'm not prematurely optimizing. I am defintely micro-optimizing on purpose just "to see".
So, I'm reading in the text file using byte arrays. Come to find out, new lines can be the (Windows) standard CR/LF or CR or LF... pretty messy. I had hoped to be able to use Array.IndexOf on CR and then skip past the LF. Instead I find myself writing code very similar to IndexOf but checking for either and returning an array as needed.
So the crux: using very similar code to IndexOf, my code still ends up being insanely slower. To put it in perspective using an 800mb file:
- Using IndexOf and looking for CR: ~320mb/s
- Using StreamReader and ReadLine: ~180mb/s
- for loop replicating IndexOf: ~150mb/s
here's the code with the for loop (~150mb/s):
IEnumerator<byte[]> IEnumerable<byte[]>.GetEnumerator() {
using(FileStream fs = new FileStream(_path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, _bufferSize)) {
byte[] buffer = new byte[_bufferSize];
int bytesRead;
int overflowCount = 0;
while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) {
int bufferLength = bytesRead + overflowCount;
int lastPos = 0;
for(int i = 0; i < bufferLength; i++) {
if(buffer[i] == 13 || buffer[i] == 10) {
int length = i - lastPos;
if(length > 0) {
byte[] line = new byte[length];
Array.Copy(buffer, lastPos, line, 0, length);
yield return line;
}
lastPos = i + 1;
}
}
if(lastPos > 0) {
overflowCount = bufferLength - lastPos;
Array.Copy(buffer, lastPos, buffer, 0, overflowCount);
}
}
}
}
this is the faster code block (~320mb/s):
while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) {
int bufferLength = bytesRead + overflowCount;
int pos = 0;
int lastPos = 0;
while(pos < bufferLength && (pos = Array.IndexOf<byte>(buffer, 13, pos)) != -1) {
int length = pos - lastPos;
if(length > 0) {
byte[] line = new byte[length];
Array.Copy(buffer, lastPos, line, 0, length);
yield return line;
}
if(pos < bufferLength - 1 && buffer[pos + 1] == 10)
pos++;
lastPos = ++pos;
}
if(lastPos > 0) {
overflowCount = bufferLength - lastPos;
Array.Copy(buffer, lastPos, buffer, 0, overflowCount);
}
}
(No, it ain't production ready, certain cases will make it blow up; I use a 128kb size buffer to ignore most of those.)
So my big question is... why does Array.IndexOf work so much faster? It is essentially the same, a for loop walking an array. Is there something about the way mscorlib code is executed? Even changing the above code to really replicate IndexOf and looking for just CR and then skipping LF like I would if using IndexOf doesn't help. Errr... I've been going through various permutations and it's late enough that perhaps there is some glaring bug I am missing?
BTW, I looked into ReadLine and noticed it uses a switch block rather than an if block... when I do something similar, weirdly enough it does increase the performance by about 15mb/s. That's another questions for another time (why is switch faster than if?) but I figured I'd point out that I did look at it.
Also, I am testing a release build outside of VS so there is no debuggery going on.