tags:

views:

879

answers:

11

I have a PHP client that requests an XML file over HTTP (i.e. loads an XML file via URL). As of now, the XML file is only several KB in size. A problem I can foresee is that the XML becomes several MBs or Gbs in size. I know that this is a huge question and that there are probably a myriad of solutions, but What ideas do you have to transport this data to the client?

Thanks!

A: 

Gallery2, which allows you to upload photos over http, makes you set up a couple of php parameters, post_max_size and upload_max_filesize, to allow larger uploads. You might want to look into that.

It seems to me that posting large files has problems with browser time-outs and the like, but on the plus side it works with proxy servers and firewalls better than trying a different file upload protocol.

Paul Tomblin
Browsers should not have an issue, I've used HTTP to upload a 3GB file to one of my apps in the past. It took most of the day, but got there in the end.
Ady
+3  A: 

Ignoring how well a browser may or may-not handle a GB-sized XML file, the only real concern I can think of off the top of my head is if the execution time to generate all the XML is greater than any execution time thresholds that are set in your environment.

Peter Bailey
A: 

Thanks for the responses. I failed to mention that transferring the file should be relatively fast (few mintues max, is this even possible?). The XML that is requested will be parsed and inserted into a database every night. The XML may be the same from the night before, or it may be different. One solution that was proposed is to zip the xml file and then transfer it. So there are basically two requirements: 1. it has to relatively fast 2. it should minimize the number of writes to the database.

One solution that was proposed is to zip the xml file and then transfer it. but that only satisfies (1)

Any other ideas?

CoolGravatar
First you said that the XML could grow to GBs, and now you're saying "few minutes max". You can't do both.
Paul Tomblin
+3  A: 

based on your use case i'd definitely suggest zipping up the data first. in addition, you may want to md5 hash the file and compare it before initiating the download (no need to update if the file has no changes), this will help with point #2.

also, would it be possible to just send a segment of XML that has been instead of the whole file?

Owen
Sending segments would be a good idea and it's definitely feasible.
CoolGravatar
great, i'd suggest hashing the whole file on both ends, and comparing hashes before initiating a transfer. if there are updates, just send the segment (gzipped as mentioned) and then piece it together at the "client". if you're not tied to XML, perhaps a lighter weight solution (json?) may be better
Owen
+2  A: 

Given that the XML is created dynamically with your PHP, the simplest thing I can think of is to ensure that the file is gzipped automatically by the webserver, like described here, it offers a general PHP approach and an Apache httpd-specific solution.

Besides that, having a browser (what else can be a PHP-client?) do such a job every night for some data synchonizing sounds like there must be a far simpler solution somewhere else.

And, of course, at some point, transferring "a lot" of data is going to take "a lot" of time...

lImbus
"what else can be a PHP-client?" Another server - like a soapclient =P
Peter Bailey
running locally, php would be able to download files via a CLI, which could be added as a cron job.
nickf
A: 

Are there any algorithms that I could apply to compress the XML? How are large files such as MP3s being downloaded in a matter of seconds?

CoolGravatar
well that's a different matter altogether, depending on your server's upload speed and the user's (or your other machines) download speeds, and of course all the random tubes inbetween :)
Owen
MP3s are not several GBs large
Shinhan
A: 

PHP receiving GB's of data will take long and is overhead. Even more perceptible to flaws.

I would - dispatch the assignment to a shellscript (wget with simple error catching) that is not bothered by execution time and on failure could perhaps even retry on its own merit.

Am not experienced with this, but though one could use exec() or alike, these sadly run modal.

Calling a script with **./test.sh &** makes it run in background and solves that problem / i guess. The script could easily let your PHP pick it back up via a wget `http://yoursite.com/continue-xml-stuff.php?id=1049381023&status=0´. The id could be a filename, if you don't need to backtrack lost requests. The status would indicate how the script ended up handling the request.

Or just run it from cron.
A: 

Have you thought about using some sort of version control system to handle this? You could leverage its ability to calculate and send just the differences in the files, plus you get the added benefits of maintaining a version history of your file.

nickf
A: 

Since I don't know the details of your situation I'll throw question out there. Just for sake of argument does it have to be HTTP? FTP is much better suited for large data transfer and can be automated easily via PHP or Perl.

Chris Kloberdanz
It doesn't have to be HTTP. That was the original plan, but I'm free to use whatever protocol...FTP might be work. However, I'm experimenting with compressing the XML then send it over HTTP.
CoolGravatar
+1  A: 

The problem is that he's syncing up two datasets. The problem is completely misstated.

You need to either a) keep a differential log of changes to dataset A to that you can send that log to dataset B, or b) keep two copies of the dataset (last nights and the current dataset), and then compare them so you can then send the differential log from A to B.

Welcome to the world of replication.

The problem with (a) is that it's potentially invasive to all of your code, though if you're using an RDBMS you could do some logging perchance via database triggers to keep track of inserts/updates/deletes, and write the information in to a table, then export the relevant rows as your differential log. But, that can be nasty too.

The problem with (b) is the whole "comparing the database" all at once. Fine for 100 rows. Bad for 10^9 rows. Nasty nasty.

In fact, it can all be nasty. Replication is nasty.

A better plan is to look into a "real" replication system designed for the particular databases that you're running (assuming you're running a database). Something that perhaps sends database log records over for synchronization rather than trying to roll your own.

Most of the modern DBMS systems have replication systems.

Will Hartung
A: 

If you are using Apache, you might also consider Apache mod_gzip. This should allow you to compress the file automatically and the decompression should also happen automatically, as long as both sides accept gzip compression.

Darryl Hein