ansaurus

Question

Best Approach for Checking and Inserting Records

Answer 1

+1 A:

Here's my thoughts on your utility script...

1) Is just a good practice anyway, I'd do it no matter what.

2) May save you a considerable amount of execution time. If you can solve a problem in straight SQL without using iteration in a C-Program, this can save a fair amount of time. You'll have to profile it first to ensure it really does in a test environment.

3) LOAD DATA INFILE is a tactic to use when inserting a massive amount of data. If you have a lot of records to insert (I'd write a query to do an analysis to figure out how many records you'll have to insert into table B), then it might behoove you to load them this way.

Dropping the indexes before the insert can be helpful to reduce running time, but you'll want to make sure you put them back when you're done.

Although... why aren't all the records in table B in the first place? You haven't mentioned how processing works, but I would think it would be advantageous to ensure (in your app) that the records got there without your service script's intervention. Of course, you understand your situation better than I do, so ignore this paragraph if it's off-base. I know from experience that there are lots of reasons why utility cleanup scripts need to exist.

EDIT: After reading your revised post, your problem domain has changed: you have a bunch of records in a (searchable?) flat file that you need to load into the database based on certain criteria. I think the trick to doing this as quickly as possible is to determine where the C application is actually the slowest and spends the most time spinning its proverbial wheels:

If it's reading off the disk, you're stuck, you can't do anything about that, unless you get a faster disk.
If it's doing the SQL query-insert operation, you could try optimizing that, but your'e doing a compare between two databases (the flat-file and the MySQL one)

A quick thought: by doing a LOAD DATA INFILE bulk insert to populate a temporary table very quickly (perhaps even an in-memory table if MySQL allows that), and then doing the INSERT IF NOT EXISTS might be faster than what you're currently doing.

In short, do profiling, and figure out where the slowdown is. Aside from that, talk with an experienced DBA for tips on how to do this well.

sheepsimulator 2010-05-21 18:44:06

There is about 2.5mil records (but not all). I can use a script to call our tools instead and do some string parsing but I think it'll be the same (if not slower). I will try to clarify additional points in the question.

nevets1219 2010-05-21 19:03:44

Answer 2

+1 A:

Why not upgrade your MySQL server to 5.0 (or 5.1), and then use a trigger so it's always up to date (no need for the monthly script)?

DELIMITER //
CREATE TRIGGER insert_into_a AFTER INSERT ON source_table
FOR EACH ROW 
BEGIN
    IF NEW.foo > 1 THEN
        SELECT id AS @testvar FROM a WHERE a.id = NEW.id;
        IF @testvar != NEW.id THEN
            INSERT INTO a (col1, col2) VALUES (NEW.col1, NEW.col2);
            INSERT INTO b (col1, col2) VALUES (NEW.col1, NEW.col2);
        END IF
    END IF
END //
DELIMITER ;

Then, you could even setup update and delete triggers so that the tables are always in sync (if the source table col1 is updated, it'll automatically propagate to a and b)...

ircmaxell 2010-05-21 18:45:58

I apologize for not being clear, the original data that's being processed is NOT in the database (it is a flat-file of sorts) so I'm not sure I can use this approach.

nevets1219 2010-05-21 19:10:45

Fair enough. Then this would not be possible at all...

ircmaxell 2010-05-21 19:24:35

Actually, now that I think about it... Create a temp table, add this trigger, then do a `LOAD DATA INFILE`... All the "overhead" is kept right in the DB, so you save on the network and parsing overheads...

ircmaxell 2010-05-21 19:54:14

Updating isn't really an option I can choose.

nevets1219 2010-05-24 16:14:47

Answer 3

A:

I discussed with another colleague and here is some of the improvements we came up with:

For:

SELECT X FROM TABLE_A WHERE Y=Z;

Change to (currently waiting verification on whether X is and always unique):

SELECT X FROM TABLE_A WHERE X=Z LIMIT 1;

This was an easy change and we saw some slight improvements. I can't really quantify it well but I did:

SELECT X FROM TABLE_A ORDER BY RAND() LIMIT 1

and compared the first two query. For a few test there was about 0.1 seconds improvement. Perhaps it cached something but the LIMIT 1 should help somewhat.

Then another (yet to be implemented) improvement(?):

for record number X in entire record range:
  if (no CACHE)
    CACHE = retrieve Y records (sequentially) from the database
  if (X exceeds the highest record number in cache)
    CACHE = retrieve the next set of Y records (sequentially) from the database
  search for record number X in CACHE
  ...etc

I'm not sure what to set Y to, are there any methods for determining what's a good sized number to try with? The table has 200k entries. I will edit in some results when I finish implementation.

nevets1219 2010-05-26 22:53:26

ansaurus

tags:

views:

answers:

Best Approach for Checking and Inserting Records

related questions