views:

42

answers:

3

EDIT: To clarify the records originally come from a flat-file database and is not in the MySQL database.

In one of our existing C programs which purpose is to take data from the flat-file and insert them (based on criteria) into the MySQL table:

Open connection to MySQL DB
for record in all_record_of_my_flat_file:
  if record contain a certain field:
    if record is NOT in sql_table A: // see #1
      insert record information into sql_table A and B // see #2
Close connection to MySQL DB
  1. select field from sql_table A where field=XXX
  2. 2 inserts

I believe that management did not feel it is worth it to add the functionality so that when the field in the flat file is created, it would be inserted into the database. This is specific to one customer (that I know of). I too, felt it odd that we use tool such as this to "sync" the data. I was given the duty of using and maintaining this script so I haven't heard too much about the entire process. The intent is to primarily handle additional records so this is not the first time it is used.

This is typically done every X months to sync everything up or so I'm told. I've also been told that this process takes roughly a couple of days. There is (currently) at most 2.5million records (though not necessarily all 2.5m will be inserted and most likely much less). One of the table contains 10 fields and the other 5 fields. There isn't much to be done about iterating through the records since that part can't be changed at the moment. What I would like to do is speed up the part where I query MySQL.

I'm not sure if I have left out any important details -- please let me know! I'm also no SQL expert so feel free to point out the obvious.

I thought about:

  1. Putting all the inserts into a transaction (at the moment I'm not sure how important it is for the transaction to be all-or-none or if this affects performance)
  2. Using Insert X Where Not Exists Y
  3. LOAD DATA INFILE (but that would require I create a (possibly) large temp file)

I read that (hopefully someone can confirm) I should drop indexes so they aren't re-calculated.

mysql Ver 14.7 Distrib 4.1.22, for sun-solaris2.10 (sparc) using readline 4.3

+1  A: 

Here's my thoughts on your utility script...

1) Is just a good practice anyway, I'd do it no matter what.

2) May save you a considerable amount of execution time. If you can solve a problem in straight SQL without using iteration in a C-Program, this can save a fair amount of time. You'll have to profile it first to ensure it really does in a test environment.

3) LOAD DATA INFILE is a tactic to use when inserting a massive amount of data. If you have a lot of records to insert (I'd write a query to do an analysis to figure out how many records you'll have to insert into table B), then it might behoove you to load them this way.

Dropping the indexes before the insert can be helpful to reduce running time, but you'll want to make sure you put them back when you're done.

Although... why aren't all the records in table B in the first place? You haven't mentioned how processing works, but I would think it would be advantageous to ensure (in your app) that the records got there without your service script's intervention. Of course, you understand your situation better than I do, so ignore this paragraph if it's off-base. I know from experience that there are lots of reasons why utility cleanup scripts need to exist.


EDIT: After reading your revised post, your problem domain has changed: you have a bunch of records in a (searchable?) flat file that you need to load into the database based on certain criteria. I think the trick to doing this as quickly as possible is to determine where the C application is actually the slowest and spends the most time spinning its proverbial wheels:

  • If it's reading off the disk, you're stuck, you can't do anything about that, unless you get a faster disk.
  • If it's doing the SQL query-insert operation, you could try optimizing that, but your'e doing a compare between two databases (the flat-file and the MySQL one)

A quick thought: by doing a LOAD DATA INFILE bulk insert to populate a temporary table very quickly (perhaps even an in-memory table if MySQL allows that), and then doing the INSERT IF NOT EXISTS might be faster than what you're currently doing.

In short, do profiling, and figure out where the slowdown is. Aside from that, talk with an experienced DBA for tips on how to do this well.

sheepsimulator
There is about 2.5mil records (but not all). I can use a script to call our tools instead and do some string parsing but I think it'll be the same (if not slower). I will try to clarify additional points in the question.
nevets1219
+1  A: 

Why not upgrade your MySQL server to 5.0 (or 5.1), and then use a trigger so it's always up to date (no need for the monthly script)?

DELIMITER //
CREATE TRIGGER insert_into_a AFTER INSERT ON source_table
FOR EACH ROW 
BEGIN
    IF NEW.foo > 1 THEN
        SELECT id AS @testvar FROM a WHERE a.id = NEW.id;
        IF @testvar != NEW.id THEN
            INSERT INTO a (col1, col2) VALUES (NEW.col1, NEW.col2);
            INSERT INTO b (col1, col2) VALUES (NEW.col1, NEW.col2);
        END IF
    END IF
END //
DELIMITER ;

Then, you could even setup update and delete triggers so that the tables are always in sync (if the source table col1 is updated, it'll automatically propagate to a and b)...

ircmaxell
I apologize for not being clear, the original data that's being processed is NOT in the database (it is a flat-file of sorts) so I'm not sure I can use this approach.
nevets1219
Fair enough. Then this would not be possible at all...
ircmaxell
Actually, now that I think about it... Create a temp table, add this trigger, then do a `LOAD DATA INFILE`... All the "overhead" is kept right in the DB, so you save on the network and parsing overheads...
ircmaxell
Updating isn't really an option I can choose.
nevets1219
A: 

I discussed with another colleague and here is some of the improvements we came up with:

For:

SELECT X FROM TABLE_A WHERE Y=Z;

Change to (currently waiting verification on whether X is and always unique):

SELECT X FROM TABLE_A WHERE X=Z LIMIT 1;

This was an easy change and we saw some slight improvements. I can't really quantify it well but I did:

SELECT X FROM TABLE_A ORDER BY RAND() LIMIT 1

and compared the first two query. For a few test there was about 0.1 seconds improvement. Perhaps it cached something but the LIMIT 1 should help somewhat.

Then another (yet to be implemented) improvement(?):

for record number X in entire record range:
  if (no CACHE)
    CACHE = retrieve Y records (sequentially) from the database
  if (X exceeds the highest record number in cache)
    CACHE = retrieve the next set of Y records (sequentially) from the database
  search for record number X in CACHE
  ...etc

I'm not sure what to set Y to, are there any methods for determining what's a good sized number to try with? The table has 200k entries. I will edit in some results when I finish implementation.

nevets1219