Best practice for operating on large amounts of data

views:

111

answers:

+1 Q:

Best practice for operating on large amounts of data

I need to do a lot of processing on a table that has 26+ million rows:

Determine correct size of each column based on said column's data
Identify and remove duplicate rows.
Create a primary key (auto incrementing id)
Create a natural key (unique constraint)
Add and remove columns

Please list your tips on how to speed this process up and the order in which you would do the list above.

Thanks so much.

UPDATE: Don't need to worry about concurrent users. Also, there are no indexes on this table. This table was loaded from a source file. When all said and done there will be indexes.

UPDATE: If you use a different list from what I listed, please feel free to mention it.

Based on comments so far and what I have found worked:

Create a subset of rows from the 26+ million rows. I found that 500,000 rows works well.
Delete columns that won't be used (if any)
Set appropriate datatype lengths for all columns in one scan using max(len())
Create a (unique if possible) clustered index on column/columns that will eventually be the natural key.
Repeat steps 2-4 on all the rows

+2 A:

If you are going to remove some columns, you should probably do that first if possible. This will reduce the amount of data you have to read for the other operations.

Bear in mind that when you modify data this may also require modifying indexes that include the data. It is therefore often a good idea to remove the indexes if you plan to make a large number of updates to the table, then add them again afterwards.

Mark Byers 2010-08-13 22:52:35

It won't reduce any I/O deleting a column is just a metadata operation in SQL Server.

Martin Smith 2010-08-13 22:53:19

@Martin Smith: Step 1: There is no point knowing the correct size of the column if you are going to delete is anyway, so you save time in this step by just deleting this column. Step 2: It will also not need to be read when he compares rows to see if they are duplicates - saving time here too.

Mark Byers 2010-08-13 22:56:15

@Mark - That could equally be achieved by simply not doing these steps for columns that are destined for deletion. But I guess it doesn't make any difference really.

Martin Smith 2010-08-13 23:01:18

@Martin Smith: I'd say it's simpler to delete them at the start than having to keep remembering to skip them. If you skip the columns and then delete at the end it just adds unnecessary complication to the process without any benefit.

Mark Byers 2010-08-13 23:04:53

@Mark - Yep I agree actually. I can see no downside to deleting them straight away and it would definitely need to be done before adding the clustered index anyway.

Martin Smith 2010-08-13 23:17:11

Order: 5, 2, 1, 3, 4

1: No way around it: Select Max(Len(...)) From ...

2: That all depends on what you consider a duplicate.

3: ALTER TABLE in Books Online will tell you how. No way to speed this up, really.

4: See 3.

5: See 3.

Stu 2010-08-13 22:55:24

ansaurus

tags:

views:

answers:

Best practice for operating on large amounts of data

related questions