views:

469

answers:

3

I've never tried this - so I don't know if I'd run into memory issues.

But can a SqlDataReader read a trillion records? It's all streamed correct? I'm a little green to what the SQL/TDS protocol is doing under the covers.

UPDATE Translate Trillion to mean very large number. I probably should have said something like 1 billion or 100 million.

+5  A: 

Yes, that will stream... but I don't think you should actually try to do it.

If you could read a million records per second (which sounds unlikely to me) you'd still need 12 days to read a trillion records... that's a lot of work to risk losing half way through.

Now I realise you probably don't really want to read a trillion records, literally, but my point is that if you can separate your "large amount" of work into logical batches anyway, that's probably a good idea.

Jon Skeet
So my original question was going to be what is the best batching strategy for ADO.NET and SQL Server... so what is the best way to tackle 1000 records at a time. Say you are doing a MapReduce type activity. I realize there are other tools for this (Open and Commercial) but if the company you are working for won't let you use them ... they do me no good. (except to try to borrow ideas from)
tyndall
Good point about the 12 days +1. Maybe I picked too high a number.
tyndall
To be honest, the best batching strategy will depend on the exact nature of the task. Can you reliably split it into batches even if the queries are run at a later date? Can you split it into batches ahead of time, and give different computers different batches? Is anything else writing to this data? Are there appropriate indexes? Basically it's a case of working out a way to split your data in a queryable and efficient form.
Jon Skeet
So those kinds of questions are what I was struggling with. People may be writing to the data while I'm in the middle of the process. I don't have a good "snapshot" strategy. That is the one I'm really trying to get my brain around.
tyndall
At first it may be one server (4 cores) working on the data. Maybe two or three servers by end of year. Thinking about using some F# in this project. Seems well suited for it.
tyndall
+1  A: 

Yes - it might take a while (as long as your SQL isn't doing anything silly trying to take a snapshot or anything), but if your server can stream it out, the SqlDataReader shouldn't have a memory usage problem.

Cade Roux
+1  A: 

There are a few details.

  • SqlDataReader will normally read an entire row in memory and cache it. This includes any BLOB fields, so you can end up caching several 2GB fields in memory (XML, VARBINARY(MAX), VARCHAR(MAX), NVARCHAR(MAX)). If such fields are a concern then you must pass in the CommandBehavior.SequentialAccess to ExecuteReader and use the streaming capabilities of the SqlClient specific types like SqlBytes.Stream.

  • A connection is busy until the SqlDataReader completes. This creates transactional problems because you won't be able to to any processing in the database in the same transaciton, because the connection is busy. Trying to open a different conneciton and enroll in the same transaction will fail, as loop back distributed transacitons are prohibited. The loution is to use MARS. You do so by setting MultipleActiveResultSets=True on the connection. This allows you to issue command on the same connection while a data reader is still active (typical fetch-process-fetch loop). Read the link to Christian Kleinerman's with great care, make sure you understand the issues and restrictions around MARS and transactions, they're quite subtle and counter intuitive.

  • Lengthy processing in the client will block the server. Your query will still be executing all this time and the server will have to suspend it when the communication pipe fills up. A query consumes a worker (or more if it has parallel plans) and workes are a very scarce commodity in a server (they equate roughly to threads). You won't be bale to afford many clients processing huge result sets at their own leissure.

  • Transaction size. Processing a trillion records on one transaction is never going to work. The log will have to grow to accomodate the entire transaction and won't truncate and reuse the VLFs, resulting in huge log growth.

  • Recovery time. If processing fails at the 999 billionth record it will have to rollback all the work done, so it will take another '12' days just to rollback.

Remus Rusanu
Very good information. +1 What role do transactions play in the system if the data only needs to be eventually consistant? What would you suggest is the proper way to batch process 1000 or 10000 at a time? (see comments to Jon Skeet)
tyndall
The proper way to create batches that can be safely resumed depends on the actual task being done. A trivial example is to have a table with 'current' clustered key value. In a transaction you get the value from the table, select next 10k rows order by clustered key, process them, update the current key value in table, commit. Rinse, cycle and repeat.
Remus Rusanu