ansaurus

Question

Answer 1

+4 A:

One caveat; you might want to double-check your CPU's endianness... assuming little-endian is not quite safe (think: itanium etc).

You might also want to see if BufferedStream makes any difference (I'm not sure it will).

Marc Gravell 2009-08-06 11:50:37

Yup, I'm aware of endianess issues, but this is a proprietary application where I have full control over deployment. Regarding BufferedStream, from my understanding the FileStream is already buffered, so it would just add an unnecessary intermediary buffer. P.S.: I'm also using your protobuf library in this project, so many thanks for that :)

legenden 2009-08-06 11:53:56

I just made a new measurement with a wrapping BufferedStream, and as anticipated, there is no difference.

legenden 2009-08-06 12:01:24

Answer 2

+2 A:

When you do a filecopy, large chunks of data are read and written to disk.

You are reading the entire file 4 bytes at a time. This is bound to be slower. Even if the stream implementation is smart enough to buffer, you still have at least 500mb/4 = 131072000 api calls.

Isn't it more wise to just read a large chunk of data, and then go through it sequentially, and repeat until the file has been processed?

R

Toad 2009-08-06 11:59:08

There's a parameter in the FileStream constructor which specifies the buffer size, so the read is indeed done in chunks. I tried various values for the buffer size, but there were no major improvements. Extra large buffer sizes actually hurt performance in my measurements.

legenden 2009-08-06 12:03:44

still you are doing the immense number of calls to 'ReadInt32'. Just getting it yourself from a consecutive piece of memory will be much quicker.

Toad 2009-08-06 12:06:44

Please re-read the question, I am not using ReadInt32 in the actual implementation, there is only 1 read per object, and all the conversions are inlined, see the last two blocks of code.

legenden 2009-08-06 12:12:14

right... sorry about that. I guess then that the immense amount of small memory allocations might be the problem.

Toad 2009-08-06 12:25:55

I will award your question as the accepted answer because you suggested reading large chunks of data from the file. That would have been redundant if the actual FileStream's buffering implementation wasn't flawed, but apparently it is.

legenden 2009-08-06 12:30:39

Answer 3

+3 A:

Interesting, reading the whole file into a buffer and going through it in memory made a huge difference. This is at the cost of memory, but we have plenty.

This makes me think that the FileStream's (or BufferedStream's for that matter) buffer implementation is flawed, because no matter what size buffer I tried, performance still sucked.

  using (var br = new FileStream(cacheFilePath, FileMode.Open, FileAccess.Read, FileShare.Read, 0x10000, FileOptions.SequentialScan))
  {
      byte[] buffer = new byte[br.Length];
      br.Read(buffer, 0, buffer.Length);
      using (var memoryStream = new MemoryStream(buffer))
      {
          while (memoryStream.Position < memoryStream.Length)
          {
              var doc = DocumentData.Deserialize(memoryStream);
              docData[doc.InternalId] = doc;
          }
      }
  }

Down to 2-5 seconds (depends on disk cache I'm guessing) now from 22. Which is good enough for now.

legenden 2009-08-06 12:21:53

so my answer wasn't that flawed ;^)

Toad 2009-08-06 12:26:53

Thanks. But there's actually a problem with .NET's buffer implementation, because I tried a buffer size exactly as big as the file (which should have been equivalent to the intermediary MemoryStream), and that still sucked performance-wise. In theory your suggestion should have been redundant, but in practice - jackpot.

legenden 2009-08-06 12:36:10

ansaurus

tags:

views:

answers:

Faster (unsafe) BinaryReader in .NET

related questions