I need to do a lot of processing on a table that has 26+ million rows:
- Determine correct size of each column based on said column's data
- Identify and remove duplicate rows.
- Create a primary key (auto incrementing id)
- Create a natural key (unique constraint)
- Add and remove columns
Please list your tips on how to speed this process up and the order in which you would do the list above.
Thanks so much.
UPDATE: Don't need to worry about concurrent users. Also, there are no indexes on this table. This table was loaded from a source file. When all said and done there will be indexes.
UPDATE: If you use a different list from what I listed, please feel free to mention it.
Based on comments so far and what I have found worked:
- Create a subset of rows from the 26+ million rows. I found that 500,000 rows works well.
- Delete columns that won't be used (if any)
- Set appropriate datatype lengths for all columns in one scan using max(len())
- Create a (unique if possible) clustered index on column/columns that will eventually be the natural key.
- Repeat steps 2-4 on all the rows