views:

4196

answers:

4

I am trying to take one step towards optimizing a 90GB+ table:

Old Table

Every day the table grabs approx. 750,000 records from an external source and adds them to the table with the new date. This has been going on for three years from what I understand. 97% of the records don't change from one day to the next.

New Table

I am trying to go through old table (millions and millions of records) and eliminate redundancy which will likely reduce the table size quite dramatically.

old_table

  • date
  • record_id
  • data_field (really many fields, but for the sake of the example)

new_table_index

  • date
  • index_id

new_table

  • index_id
  • record_id
  • data_field

Logic as we go through each record in old_table

if (record_id is not in new_table) or (record_id is in new_table, but the latest entry of it has a different data_field)

insert it into the new_table and get the index_id

else

get the latest entry index_id for that record_id from the new_table_index

always

insert the index_id and date into the new_table_index

Any thoughts on optimal ways to do this? I am not advanced enough with MySQL to put this all together. When I tried writing a script in PHP it used up 3GB of memory and then failed. Other suggestions or queries??? Thanks so much!

A: 

You could add a column to the table that stores the LastModified time. then an On Insert or On Update trigger to set that value to the current time. Your data porting process could simply grab those records that have a LastMotified of greater than your last data port.

If you index this new field it should be a lot faster than comparing all the data field values.

If you don't need hourly granularity on these checks then you can simply make it a Date field type instead of a datetime. The field will be smaller, so more of them will stay in memory and your where filter will happen faster.

Chris
Clever idea. Unfortunately, the old_table is already created and in place with three years of records. So adding a Modified field to it would be too late. This is a one-time transfer to get all the date over to the new tables.
Joshua
A: 

First of all, I don't think there's any need for creating two new tables. If you need an index, well, that's what the MySQL indexes are for: just create a new table and set an index to its date field.

A simple script should do it (assuming auto-increment for new_table index_id):

INSERT INTO new_table (date, record_id, data_field)
  SELECT
    date,
    record_id,
    data_field
  FROM
    old_table
  GROUP BY
    data_field

Before doing it, you might consider creating an index to all data_field's involved. That way it would be extremely faster.

Seb
Sorry, let me clarify. The reason for the new_table_index is to still have an entry for every record on every date (some of our queries need that).Thanks for your script. However, that doesn't really take into account the logic I wrote down - any suggestions on that??? Thanks!
Joshua
+4  A: 

You could use this:

new_table
    * date
    * record_id (pk)
    * data_field


INSERT INTO new_table (date,record_id,data_field)
    SELECT date, record_id, data_field FROM old_table
        ON DUPLICATE KEY UPDATE date=old_table.data, data_field=old_table.data_field;

record id is the primary key and this same insert could be added below the insert into the old_table.

see mySQL

sfossen
A: 

I ended up using a hybrid of PHP and MySQL (after swinging too far each way at first):

  • INSERT LINK TO PREVIOUS DAY FOR ALL PREVIOUS DAY PRs (using INSERT - SELECT)
  • COMPARE PRs AGAINST PREVIOUS DAY, INSERT IF CHANGED (using INSERT - SELECT)
  • INSERT LINK FOR NEWLY UPDATED PRs (using SELECT - php foreach - UPDATE)
  • ADD NEW PRs ON EACH DAY (using INSERT - SELECT)
  • INSERT LINK FOR NEW PRs (using INSERT - SELECT)

Still need to perfect the one with the php foreach loop, but for the most part this did the trick! Thanks for all your help!

Joshua