views:

85

answers:

4

I have run in to a slight problem. The story goes as follows:

I have a document archive system (written in PHP) which runs at multiple clients (23 at present). On their system they only have their documents. Every night, they all need to be 'synced' to a master database on site (central server). I have access to each MySQL database from the central server, so connecting to them is no problem.

I have a script that connects to the client database, selects all the entries from a table where the sync column = '0000-00-00 00:00:00' (default to indicate it wasnt synced). I would then iterate through each record, insert it to the central server, and set the sync time on the client database record to the time the script was executed. This works, but obviously has a large overhead with the multiple queries and I have just noticed the problems now.

Each client can generate up to 2000 - 3000 odd documents a day. With these large numbers it is taking way too long (1sec / 2documents).

Is there a better solution to my problem? Preferably a PHP scripted solution as I need to do logs to check if everything was succesful.

Thanks

EDIT: My current process is:

  1. Select all the un-synced data
  2. Begin transaction
  3. Insert record into central database server
  4. Select the document record from the client
  5. Insert the document into the central database server
  6. Update sync column on client
  7. Update sync column on server
  8. Commit transaction

This is a script run on the central server. Now that I come to think of it, i can remove step 7 and have it part of step 5, but that wont reduce the processing time by much.

+1  A: 

I'd suggest using auto_increment_increment to keep all the ids unique over all of the servers. Then, all you need to do is a SELECT * FROM blah WHERE sync = '0000-00-00 00:00:00', and then generate the insert statements and execute them. You won't have to deal with any kind of conflict resolution for conflicting primary keys...

As for the long query times, you need to look at the size of your data. If each record is sizable (a few hundred kb +), it's going to take time...

One option may be to create a federated table for each child server's table. Then do the whole thing in SQL on the master. INSERT INTO master_table SELECT * FROM child_1_table WHERE sync = '0000-00-00 00:00:00'... You get to avoid pulling all of the data into PHP. You can still run some checks to make sure everything went well, and you can still log since everything is still executed from PHP land...

ircmaxell
The ID's is not a problem. Those do not need to sync. I get the correct record via various other columns. I just use the ID to link the docarch_printout (All details regarding the document) table to the docarch_printout_docs table (1 - 1 Contains just the document). The other problem is that we do not have a permanent connection to the client. Some are dial on demand ISDN lines. Due to that I don't think the federated table will work. Nice idea tho, never knew MySQL had that option.
Surim
Well, I suppose you could store the create sql for the federated table in your program. Then when you connect to the client, run the create script. then drop it once you're done (so it only uses the connection when you're actively syncing)...
ircmaxell
True. I really like that idea. Everythng seems great but, I am needing to sync 2 tables (1-1 relationship) which are referenced by the id field. One being the details, the other being the actual document. Would be easy peasy of it was all just a single table. Any further thoughts on that? Thanks.
Surim
Create two federated tables, and just adjust your insert statement to only move the relevant columns... It shouldn't be too hard (especially since you can join the federated tables onto the local tables to determine the unique identifier if you need to)...
ircmaxell
I spent last nyt rewriting the database structure and managed to get the federated tables working. Thanks. Seems to be loads faster. Probably a combination of the new structure + better sync process. Thank you
Surim
A: 

The basic method sounds OK - but taking 0.5 seconds to do one operation is ridiculously excessive - how much data are you pulling across the network? The entire image? Are you doing anything else in the operation? Is there an index on the sync column?

You could get a small benefit by doing an export of the un-synced data on the database:

1) mark all records available for sync with a transaction id in a new column
2) extract all records flagged in first step into a flat file
3) copy the file across the network
4) load the data into the master DB
5) if successful notify the origin server
6) origin server then sets the sync time for all records flagged with that transaction id

This would require 3 scripts - 2 on the origin server (one to prepare and send the data, one to flag as complete) and one on the replicated server to poll the data AND notify outcome.

But this is probably not going to make big inroads into the performance which seems absurdly high if you are only replicating meta-data about the image (rather than the image itself).

C.

symcbean
Updated my initial post with a steb by step run through of the script so you can see my logic
Surim
This still doesn't explain why it's taking 0.5 seconds per record. Since a transaction cannot span 2 independent DBMS, its not adding any value here. How big is the record?
symcbean
It would be due to the fact the client's are connected by ADSL (a few via ISDN) to us. The record documents are only a few kilobytes of text.
Surim
hmmm, if it really is the bandwidth that's the problem then the solution is probably to add more bandwidth - you're still being very vague about the network traffic - certainly if updates of each record are coordinated across a slow link then latency may be the problem. If you try my method (single file) then you would eliminate the latency problem and could compress the file before transmitting it.
symcbean
A: 

I know you prefer a PHP based solution, but you might want to check out Microsoft Sync Framework -

http://msdn.microsoft.com/en-in/sync/default(en-us).aspx

This will necessitate the sync module to be written in .net, but there is a huge advantage in terms of sync logic and exception handling (network failure, sync conflicts, etc), which will reduce time for you.

The framework handles non-sql server databases as well, as long as there is a database connector for .net. Mysql should be supported quite easily - just take a sample from the following link -

http://code.msdn.microsoft.com/sync/Release/ProjectReleases.aspx?ReleaseId=4835

and adapt the same to mysql.

Roopesh Shenoy
That would be great, but it has to run on CentOS. We don't have any Microsoft servers
Surim
hmm.. thats a problem alright.. though it would be cheap to actually get one running just for this purpose! We actually saved a lot of dev effort with this, so you could do some Cost/benefit analysis and decide.
Roopesh Shenoy
A: 

Theres another possibility if you cant use sync framework -

Is it possible for you to distribute the load throughout the day, instead of end of day? Say, trigger synchronization every time 10 new documents come in or 10 edits are done? (this can be done if the synchronization is initiated from client side).

In case you want to take the sync logic to server side, you can consider using messaging queues to send notifications to server from clients, whenever client needs to synchronize. The server can then pull the data. You can use in-house service bus or on-demand platforms like azure appfabric/Amazon SQS for this.

Roopesh Shenoy
The documents that get archived archived are generated by another application at it's day end. The document archive system monitors the directory for new files, processes it etc. Because of this, the imports are done in batch as well. The sync needs to be done at the end of the day to tie in with when the clients with ISDN lines connect to us for other purposes.
Surim
Okay.. since you've already got an answer that works.. cheers!
Roopesh Shenoy