I'm building a toy database in C# to learn more about compiler, optimizer, and indexing technology.
I want to maintain maximum parallelism between (at least read) requests for bringing pages into the buffer pool, but I am confused about how best to accomplish this in .NET.
Here are some options and the problems I've come across with each:
Use
System.IO.FileStream
and theBeginRead
methodBut, the position in the file isn't an argument to
BeginRead
, it is a property of theFileStream
(set via theSeek
method), so I can only issue one request at a time and have to lock the stream for the duration. (Or do I? The documentation is unclear on what would happen if I held the lock only between theSeek
andBeginRead
calls but released it before callingEndRead
. Does anyone know?) I know how to do this, I'm just not sure it is the best way.There seems to be another way, centered around the
System.Threading.Overlapped
structure and P\Invoke to theReadFileEx
function in kernel32.dll.Unfortunately, there is a dearth of samples, especially in managed languages. This route (if it can be made to work at all) apparently also involves the
ThreadPool.BindHandle
method and the IO completion threads in the thread pool. I get the impression that this is the sanctioned way of dealing with this scenario under windows, but I don't understand it and I can't find an entry point to the documentation that is helpful to the uninitiated.Something else?
In a comment, jacob suggests creating a new
FileStream
for each read in flight.Read the whole file into memory.
This would work if the database was small. The codebase is small, and there are plenty of other inefficiencies, but the database itself isn't. I also want to be sure I am doing all the bookkeeping needed to deal with a large database (which turns out to be a huge part of the complexity: paging, external sorting, ...) and I'm worried it might be too easy to accidentally cheat.
Edit
Clarification of why I'm suspicious with solution 1: holding a single lock all the way from BeginRead to EndRead means I need to block anyone who wants to initiate a read just because another read is in progress. That feels wrong, because the thread initiating the new read might be able (in general) to do some more work before the results become available. (Actually, just writing this has led me to think up a new solution, I put as a new answer.)