views:

56

answers:

3

I want to add a new column to a table which already consists billions of rows. The new columns are derived from existing columns.

For example,

new_col1 = old_col1 + old_col2
new_col2 = old_col1 / old_col2

I am trying to do this in following way -

Add new columns

ALTER TABLE table_name
ADD (   column_1    column-definition,
column_2    column-definition,
... 
column_n    column_definition )

Read rows one by one from the table and fill the values for new columns.

There is no primary key in the database. So I can not refer to an individual row. To read rows one by one, I have to do a select * which would give a huge resultset (considering billions of records).

Is there any better way to do this?

A: 

use stored procedures, do an update by 100 of them, add the stored procedure as a job to run every say 30 seconds.

Flakron Bytyqi
there is no primary key and how will he know which hundered are already updated?
TheVillageIdiot
UPDATE .... LIMIT 100 - just supposing this is mysql
Flakron Bytyqi
What do you mean by "do an update by 100 of them"? Is there any way to do selects in batches of 100 records? something like this - - select * from ... range 1 to 100 - select * from ... range 101 to 200 - select * from ... range 201 to 300...
Prashant
Yeah. There is no primary key. LIMIT 100 would return a random set of rows. I simply want to scan through the table and update the new column values as I scan through it.
Prashant
There's no reason to wait 30 seconds. It looks like the question is just to save memory, not to preserve the responsiveness of the database during the update.
pascal
just as an example, nothing else, an update of 100 records can happen in 2 seconds at most (depending on data), if hosted in good hardware
Flakron Bytyqi
+1  A: 

Different DBMS have different SQL dialects, it is useful to specify which you are using in the question.

In SQL Server you could use a Computed Column but this would calculate the result every time you select the data, you could flag it as persisted but it may take a while to make the change. But you can't do that if you are going to remove the old columns.

Alternatively create the new column allowing nulls and then update it in batches

UPDATE TOP (1000) table_name SET new_col1 = old_col1 + col_col2 WHERE new_col1 IS NULL

Again the query is for SQL Server, but there will alternatives for your DBMS.

Also read Mr Hoopers comment about adding an index to the new column to make sure that the performance of the UPDATE doesn't get worse as more data is added. The update is a read and write operation, the index will speed up the reads and slightly delay the writes (maintaining the index), but it should be worthwhile.

Chris Diver
+1  A: 

I think Mr Diver's method would be fine if you also added an index on one of your new columns; otherwise, as the job progresses, it will have to do more and more scanning to find the rows it hasn't already updated. Adding an index will mean it doesn't have to do that. A possible drawback is that the index differentiation will be frightful when the column is created, but I don't think that would be a problem as you only care about NULL or NOT NULL. You could drop the index when the update is complete.

Brian Hooper