views:

42

answers:

3

We have 2 servers, which one of them is customer's. Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS and our task is to write some import scripts for importing data to webapp, which we're developing.

I've always been doing that like this:

INSERT INTO customers (name,address) VALUES ('John Doe', 'NY') ON DUPLICATE KEY UPDATE name='John Doe', address='NY'

This solution is best in the way of permormace, as far as i know...

But this solution is NOT solving the problem of deleting records. What if some client is deleted from the database and isn't now in the export - how should i do that?

Shoud I firstly TRUNCATE the whole table and then fill it again? Or should I fill some array in PHP with all records and then walk through it again and delete records, which aren't in XML/JSON?

I think there must be better solution.

I'm interested in the best solution in the way of performace, 'cause we have to import many thousands of records and the process of whole import may take a lot of time.

A: 

Assuming you are using MySQL, the only SQL I know anything about:

Is it true that the export of your customer's CMS always contains all of his current customer data? If it is true, then yes it is best imo to drop or truncate the 'customers' table; that is, to just throw away yesterday's customer table and reconstruct it today from the beginning.

But you can't use 'insert': it will take ~28 hours per day to insert thousands of customer rows. So forget about 'insert'.

Instead, add rows into 'customers' with 'load data local infile': first write a temp disk file 'cust_data.txt' of all the customer data, with column data separated somehow (perhaps by commas), and then say something like:

load data local infile 'cust_data.txt' replace into table customers fields terminated by ',' lines terminated by '\n';

Can you structure the query such that you can use your client's output file directly, without first staging it into 'cust_data.txt'? That would be the answer to a maiden's prayer.

It should be fast enough for you: you will be amazed!

ref: http://dev.mysql.com/doc/refman/5.0/en/load-data.html

Pete Wilson
But there will be available few minutes uncomplete data in our webapp just after TRUNCATE... :(And I don't believe, that replacing normal SQL queries of importing SQL query can save so much time... Or maybe the process itself can, but we have to call much URLs and fetch XML/JSON from them and this really takes much time. My priority is also the integrity and completeness of data.
Radek Šimko
Truncating the table each time means that you risk losing data - not what you want to do if its intended for backup purposes.
symcbean
Just asking, not arguing: how/where does truncating the table each time risk data loss? I have to believe that the "current" state of customer info is always around somewhere, intact and complete, at Radek's client; and is therefore available to reconstruct the 'customer' table any time. What am I missing? Thanks!
Pete Wilson
+1  A: 

I'm interested in the best solution in the way of performace

If its mysql at the client, use mysql replication - the client as the master and your end as the slave. You can either use a direct feed (you'd probably want to run this across a VPN) or in disconnected mode (they send you the bin logs to roll forward).

Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS

This is a really dumb idea - and sounds like you're trying to make the solution fit the problem (which it doesn't). HTTP is not the medium for transferring large data files across the internet. It also means that the remote server must do rather a lot of work just to make the data available (assuming it can even identify what data needs to be replicated - and as you point out, that is currently failing to work for deleted records). The latter point is true regardless of the network protocol.

You caertainly can't copy large amounts of data directly across at a lower level in the stack than the database (e.g. trying to use rsync to replicate data files) because the local mirror will nearly always be inconsistent.

C.

symcbean
A: 

If your customer can export data as csv file, you can use SQL Data Examiner http://www.sqlaccessories.com/SQL_Data_Examiner to update records in the target database (insert/update/delete) using csv file as source.

SQLDev