tags:

views:

66

answers:

4

Need to query a database for 12 million rows, process this data and then insert the filtered data into another database.

I can't just do a SELECT * from the database for obvious reasons - far too much data would be returned for my program to handle, and also this is a live database (customer order details) and I can't have the database crawl to a halt for 10 minutes while it runs my query.

I'm looking for inspiration on how to write this program. I have to process each row. I was thinking it might be best to get a count on the rows. Then grab X at a time, wait for Y seconds, and repeat, until the dataset is complete. This way I'm not overloading the database, and since X will be sufficiently small, will run nicely in memmory.

Other suggestions or feedback ?

+4  A: 

I'd recommend you read the doc about SELECT...INTO OUTFILE and LOAD DATA FROM INFILE.

These are very fast ways of dumping data to a flat file and then importing it to another database.

You could dump into the flat file, and then run an offline script to process your rows, and then once that's done import the result to the new database.

See also:

Bill Karwin
Beat me to it. +1. This is the right answer... get the data into a file as quick as possible, preferably on some disks other than what your live database uses. Then, process the file.
mattmc3
+1  A: 

Spreading the load over time seems the only practicable solution. Exactly how to do it depends to some extent on your schema, how records change over time in the "live database", and what consistency semantics your processing must have.

In the worst case -- any record can be changed at any time, there is nothing in the schema that lets you easily and speedily check for "recently modified, inserted, or deleted records", and you nevertheless need to be consistent in what you process -- the task is simply unfeasible, unless you can count on some special support from your relational engine and/or OS (such as volume or filesystem "snapshots", like in Linux's LVM, that let you cheaply and speedily "freeze in time" a copy of the volumes on which the DB resides, for later leisurely fetching with another, read-only, database configured to read from the snapshot volume).

But presumably you do have some constraints, something in the schema that helps with the issue, or else, one can hope, you can afford some inconsistency generated by changes in the DB happening at the same time as your processing -- some lines processed twice, some not processed, some processed in older versions and others in newer versions... unfortunately, you have told us next to nothing about any of these issues, making it essentially unfeasible to offer much more help. If you edit your question to provide a LOT more information on platform, schema, and DB usage patterns, maybe more help can be offered.

Alex Martelli
A: 

You don't mention which db you are using, but I doubt any db that can hold 12 million rows would actually try to return all the data to your program at once. Your program essentially streams the data in small blocks (say 1000 rows) something that is usually handled by the database driver.

RDBMSs have different transaction levels which can be used to reduce the effort the database spends maintaining consistency guarantees, which will avoid locking up the table.

Databases can also create snapshots of tables to a file for later analysis.

In your position, I would try the simplest thing first, and see how that scales (on a development copy of the db with simulated user access.)

mdma
A: 

A flat file or a snapshot are both ideal.

If a flat file does not suit or you do not have access to snapshots theny you could use a sequential id field or create a sequential id in a temp table and then iterate using that.

Something like

@max_id = 0

while exists (select * from table where seq_id > @max_id)
    select top n * from table where seq_id > @max_id order by seq_id
    ... process...
    set @max_id = @max seq_id from the last lot
end

If there is no sequential id then you can create a temp table that holds the order like

insert into some_temp_table 
select unique_id from table order by your_ordering_scheme

then process like this

... do something with top n from table join some_temp_table on unique_id ...
delete top n from some_temp_table

this way temp_table holds the record identifiers that still need to be processed.

marshall