views:

64

answers:

2

Hi all.

For work I'm dealing with a large database (160 million + rows a year, 10 years of data) and have a quandary; A large percentage of the data we upload is null data and I'd like to stop it from being uploaded.

The data in question is spatial in nature, so I have one table like so:

idLocations (Auto-increment int, PK)
X (float)
Y (foat)
Alwaysignore (Bool)

Which is used as a reference in a second table like so:

idLocations (Int, PK, "FK")
idDates (Int, PK, "FK")
DATA1 (float)
DATA2 (float)
...
DATA7 (float)

So, Ideally I'd like to find a method where I can do something like:

INSERT INTO tblData(idLocations, idDates, DATA1, ..., DATA7) VALUES (...), ..., (...)
WHERE VALUES(idLocations) NOT LIKE (SELECT FROM tblLocation WHERE alwaysignore=TRUE
ON DUPLICATE KEY UPDATE DATA1=VALUES(DATA1)

So, for my large batch of input data (250 values in a block), ignore the inserts where the idLocations matches an idLocations values flagged with alwaysignore.

Anyone have any suggestions?

Cheers. -Stuart

Other details: Running MySQL on a semi-dedicated machine, MyISAM engine for the tables.

+1  A: 
INSERT
INTO    tbldata
SELECT  *
FROM    (
        SELECT  :id1 AS id, :data1 AS id
        UNION ALL
        SELECT  :id2 AS id, :data2 AS id
        UNION ALL
        …
        ) q
WHERE   q.id NOT IN
        (
        SELECT  id
        FROM    tbllocation
        WHERE   alwaysignore
        )
ON DUPLICATE KEY
UPDATE
SET     data1 = VALUES(data1),
        …

Substitute :id1, :data1 etc in the subquery with your values.

Quassnoi
A: 

if the locations you want to ignore isn't a huge list, you could always cache those in your client, and thus avoid hitting the mysql at all for those cases? so when the program starts up you do

SELECT `x`, `y` FROM `locations` WHERE `alwaysignore` = 1;

and cache the x/y, maybe even as a string like so : "x_y". then when it comes to inserting, you simply check whether that value exists, and if so, don't do anything in mysql.

of course, if this list is enormous you probably don't want to do that, because it will suck up quite a bit of memory.

oedo
Ended up using this solution. Was easier to implement and given the size of data I'm dealing with provides a reasonable performance compromise.
Stuart P