views:

820

answers:

3

What is the fastest way to load data from flatfiles into a MySQL database, and then create the relations between the tables via foreign keys?

For example... I have a flat file in the format:

[INDIVIDUAL]   [POP]  [MARKER]  [GENOTYPE]

"INDIVIDUAL1", "CEU", "rs55555","AA"  
"INDIVIDUAL1", "CEU", "rs535454","GA"  
"INDIVIDUAL1", "CEU", "rs555566","AT"  
"INDIVIDUAL1", "CEU", "rs12345","TT"  
...  
"INDIVIDUAL2", "JPT", "rs55555","AT"

Which I need to load into four tables:

IND (id,fk_pop,name)  
POP (id,population)  
MARKER (id,rsid)  
GENOTYPE (id,fk_ind,fk_rsid,call)

Specifically, how does one populate the foreign keys in a way that scales? The figures are in the range of 1000+ individuals, each with 1 Million+ genotypes.

+7  A: 

I would take a multiple step approach to do this.

  1. Load the data into a temporary table, that is matching the file format that you have
  2. Write queries to do the other inserts, starting the the general tables, then doing joins to get the FK values.
Mitchel Sellers
A: 

You could to start with the base tables without foreign keys. You would then lookup the IDs as you insert data in the other tables.

Another idea is that you could replace the IDs in the flat file(INDIVIDUAL1,CEU, ...etc.) by GUIDs . and then use them directly as IDs and foreign keys (i noticed this is tagged performance, this may not give the best "performance").

Osama ALASSIRY
+3  A: 

There is a simpler way.

First, make sure you have a UNIQUE constraint on those columns that should have one (name, population, rsid).

Then use something like the following:

 LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE POP FIELDS TERMINATED BY ','
    ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
    (@name, population, @rsid, @call);
 LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE MARKER FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
    (@name, @population, rsid, @call);
 LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE IND FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
    (name, @population, @rsid, @call) 
    SET fk_pop = (SELECT id FROM POP WHERE population = @population);
 LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE GENOTYPE FIELDS TERMINATED BY ',' 
    ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
    (@name, @population, @rsid, call)
    SET fk_ind = (SELECT id FROM IND where name = @name),
    fk_rsid = (SELECT id FROM MARKER where rsid = @rsid);

Note where the @ is used to indicate variables, rather than column names. In the first 2 LOAD DATAs, these are just used to ignore data. In the second 2, they are used to look up the foreign keys.

Might not be very fast, mind :).