views:

507

answers:

4

I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?

I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.

My datasources are oracle and mssql. I have to do this for several datasources.

+1  A: 

A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.

Here's an example in C#:

static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
    while (reader.NextResult())
    {
        User user = new User
                        {
                            Name = reader.GetString(0),
                            Surname = reader.GetString(1)
                        };
        yield return user;
    }
}
Richard Nienaber
I don't think that DbDataReaders pull data one record at a time from a database. That would require far too many round trips. They fetch a few hunderd rows at a time by default (unless the records are very big).
tuinstoel
Added reference which states that the default implementation retrieves one row at a time.
Richard Nienaber
The OracleDataReader fetches by default blocks of 64 kb so it will retrieve more than 1 one row unless the records are very big. Read here: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html
tuinstoel
You're right, I forgot DataReaders can be provided by vendors other than Microsoft. Made my assertion more clear.
Richard Nienaber
A: 

A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.

Rob West
+1  A: 

You should use SSIS for this.

You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read() method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.

You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.

To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.

Now the difficult part.

Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:

  • You cannot use the same connection to write back the result, because it is busy with the data reader. SqlClient for SQL Server supports MARS but that only solves the problem with SQL 2005/2008.
  • If you're going to enroll the read and update in a transaction if your updates occur on a different connection (see above), then this means using distributed transactions (even when the two conencitons involved point back to the same server). Distributed transactions are slow.
  • You will need to split the processing into several batches because is very bad to process 1B+ records in a single transaction. This means also that you are going to have to be able to resume processing of an aborted batch, which means you must be able to identify records that were already processed (unless processing is idempotent).
Remus Rusanu
A: 

Just use the DbDataReader (just like damagednoob said). It is a forward only way of scrolling through the retrieved data. You don't have to dispose your data because a DbDataReader is forward only.

When you use the DbDataReader it seems that the records are retrieved one by one from the database.

It is however slightly more complicated:

Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the amount of round trips to the database. You can configure the fetch size of datareader. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However a very low value like 1 or 2 rows slows things down because with a low value retrieving the data wil require to many round trips.

You probably don't have to set the fetch size manually, the default will be just fine.

edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html

tuinstoel