views:

271

answers:

2

This is just a personal project I've been digging into. Basically, I parse a text file (say from 20mb up to about 1gb) using StreamReader. The performance is pretty solid, but still... I've been itching to see what would happen if I parse it in binary. Don't misunderstand, I'm not prematurely optimizing. I am defintely micro-optimizing on purpose just "to see".

So, I'm reading in the text file using byte arrays. Come to find out, new lines can be the (Windows) standard CR/LF or CR or LF... pretty messy. I had hoped to be able to use Array.IndexOf on CR and then skip past the LF. Instead I find myself writing code very similar to IndexOf but checking for either and returning an array as needed.

So the crux: using very similar code to IndexOf, my code still ends up being insanely slower. To put it in perspective using an 800mb file:

  • Using IndexOf and looking for CR: ~320mb/s
  • Using StreamReader and ReadLine: ~180mb/s
  • for loop replicating IndexOf: ~150mb/s

here's the code with the for loop (~150mb/s):

IEnumerator<byte[]> IEnumerable<byte[]>.GetEnumerator() {
 using(FileStream fs = new FileStream(_path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, _bufferSize)) {
  byte[] buffer = new byte[_bufferSize];
  int bytesRead;
  int overflowCount = 0;
  while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) {
   int bufferLength = bytesRead + overflowCount;
   int lastPos = 0;
   for(int i = 0; i < bufferLength; i++) {
    if(buffer[i] == 13 || buffer[i] == 10) {
     int length = i - lastPos;
     if(length > 0) {
      byte[] line = new byte[length];
      Array.Copy(buffer, lastPos, line, 0, length);
      yield return line;
     }
     lastPos = i + 1;
    }
   }
   if(lastPos > 0) {
    overflowCount = bufferLength - lastPos;
    Array.Copy(buffer, lastPos, buffer, 0, overflowCount);
   }
  }
 }
}

this is the faster code block (~320mb/s):

while((bytesRead = fs.Read(buffer, overflowCount, buffer.Length - overflowCount)) > 0) {
 int bufferLength = bytesRead + overflowCount;
 int pos = 0;
 int lastPos = 0;
 while(pos < bufferLength && (pos = Array.IndexOf<byte>(buffer, 13, pos)) != -1) {
  int length = pos - lastPos;
  if(length > 0) {
   byte[] line = new byte[length];
   Array.Copy(buffer, lastPos, line, 0, length);
   yield return line;
  }
  if(pos < bufferLength - 1 && buffer[pos + 1] == 10)
   pos++;
  lastPos = ++pos;

 }
 if(lastPos > 0) {
  overflowCount = bufferLength - lastPos;
  Array.Copy(buffer, lastPos, buffer, 0, overflowCount);
 }
}

(No, it ain't production ready, certain cases will make it blow up; I use a 128kb size buffer to ignore most of those.)

So my big question is... why does Array.IndexOf work so much faster? It is essentially the same, a for loop walking an array. Is there something about the way mscorlib code is executed? Even changing the above code to really replicate IndexOf and looking for just CR and then skipping LF like I would if using IndexOf doesn't help. Errr... I've been going through various permutations and it's late enough that perhaps there is some glaring bug I am missing?

BTW, I looked into ReadLine and noticed it uses a switch block rather than an if block... when I do something similar, weirdly enough it does increase the performance by about 15mb/s. That's another questions for another time (why is switch faster than if?) but I figured I'd point out that I did look at it.

Also, I am testing a release build outside of VS so there is no debuggery going on.

+1  A: 

The mscorlib files are ngen'd during installation. Try ngen'ing your file using Ngen.exe utility (provided along with .NET framwork I suppose)... and then check the benchmarks. Could be slightly faster.

To make your .NET code run at near-native speeds, Microsoft recommends that you "Ngen" your code during your app installation...

Mugunth Kumar
ngen doesn't really do anything else than the just-in-time compiler does. The difference is that it does the job upfront. (So the same results would be expected when running the code the second time... Because then its already pre-compiled.)
Arjan Einbu
+2  A: 

That's a good question. The short version is that it all boils down to the implementation of the IEqualityComparer that IndexOf will use. Let see the following piece of code:

using System;
using System.Collections.Generic;
using System.Diagnostics;

class Program {

    static int [] buffer = new int [1024];
    const byte mark = 42;
    const int iterations = 10000;

    static void Main ()
    {
     buffer [buffer.Length -1] = mark;

     Console.WriteLine (EqualityComparer<int>.Default.GetType ());

     Console.WriteLine ("Custom:  {0}", Time (CustomIndexOf));
     Console.WriteLine ("Builtin: {0}", Time (ArrayIndexOf));
    }

    static TimeSpan Time (Action action)
    {
     var watch = new Stopwatch ();
     watch.Start ();
     for (int i = 0; i < iterations; i++)
      action ();
     watch.Stop ();
     return watch.Elapsed;
    }

    static void CustomIndexOf ()
    {
     for (int i = 0; i < buffer.Length; i++)
      if (buffer [i] == mark)
       break;
    }

    static void ArrayIndexOf ()
    {
     Array.IndexOf (buffer, mark);
    }
}

You'll need to compile it with csc /optimize+.

Here's the result I have:

C:\Tmp>test
System.Collections.Generic.GenericEqualityComparer`1[System.Int32]
Custom:  00:00:00.0386403
Builtin: 00:00:00.0427903

Now, change the type of the array and of the EqualityComparer to byte, and here's the result I have:

C:\Tmp>test
System.Collections.Generic.ByteEqualityComparer
Custom:  00:00:00.0387158
Builtin: 00:00:00.0165881

As you can see, the byte array is special cased, which is probably optimized to find a byte in a byte array. As I can't decompile the .net framework, I stopped the analyze here, but I guess that's a pretty good clue.

Jb Evain
Bingo! It ended up using ByteEqualityComparer which in turn used Buffer using pointers.Doing something similar, I now hit ~350mb/s.Must say, I'm rather disappointed in the managed performance. Not sure if I really want to drop into unsafe for this little project.
Sean Newton