views:

70

answers:

6

My client wanted a way to offer downloads to users, but only after they fill out a registration form (basically name and email). An email is sent to the user with the links for the downloadable content. The links contain a registration hash unique to the package, file, and user, and they actually go to a PHP page that logs each download and pushes the file out by writing it to stdout (along with the appropriate headers. This solution has inherent flaws, but this is how they wanted to do it. It needs to be said that I pushed them hard to 1.) limit the sizes of the downloadable files and 2.) think about using a CDN (they have international customers but are hosted in the US on 2 mirrored servers and a load balancer that uses sticky IPs). Anyway, it "works for me" but some of their international customers are on really slow connections (d/l rates of ~60kB/sec) and some of these files are pretty big (150 MB). Since this is a PHP script that is serving these files, it is bound by the script timeout setting. At first I had set this to 300 seconds (5 minutes), but this was not enough time for some of the beta users. So then I tried calculating the script timeout based on the size of the file divided by a 100kb/sec connection, but some of these users are even slower than that.

Now the client wants to just up the timeout value. I don't want to remove the timeout all together in case the script somehow gets into an infinite loop. I also don't want to keep pushing out the timeout arbitrarily for some catch-all lowest-common-denominator connection rate (most people are downloading much faster than 100kb/sec). And I also want to be able to tell the client at some point "Look, these files are too big to process this way. You are affecting the performance of the rest of the website with these 40-plus minute connections. We either need to rethink how they are delivered or use much smaller files."

I have a couple of solutions in mind, which are as follows:

  1. CDN - move the files to a CDN service such as Amazon's or Google's. We can still log the download attempts via the PHP file, but then redirect the browser to the real file. One drawback with this is that a user could bypass the script and download directly from the CDN once they have the URL (which could be gleaned by watching the HTTP headers). This isn't bad, but it's not desired.
  2. Expand the server farm - Expand the server farm from 2 to 4+ servers and remove the sticky IP rule from the load balancer. Downside: these are Windows servers so they are expensive. There is no reason why they couldn't be Linux boxes, but setting up all new boxes could take more time than the client would allow.
  3. Setup 2 new servers strictly for serving these downloads - Basically the same benefits and drawbacks as #2, except that we could at least isolate the rest of the website from (and fine tune the new servers to) this particular process. We could also pretty easily make these Linux boxes.
  4. Detect the users connection rate - I had in mind a way to detect the current speed of the user by using AJAX on the download landing page to time how long it takes to downloading a static file with a known file size, then sending that info to the server and calculating the timeout based on that info. It's not ideal, but it's better than estimating the connection speed too high or too low. I'm not sure how I would get the speed info back to the server though since we currently use a redirect header that is sent from the server.

Chances are #'s 1-3 will be declined or at least pushed off. So is 4 a good way to go about this, or is there something else I haven't considered?

(Feel free to challenge the original solution.)

+1  A: 

The easy solution would be to disable the timeout. You can do this on a per-request basis with:

set_time_limit(0);

If your script is not buggy, this shouldn't be problem – unless your server is not able to handle so many concurrent connections due to slow clients.

In that case, #1, #2 and #3 are two good solutions, and I would go with whichever is cheaper. Your concerns about #1 could be mitigated by generating download tokens that could only be used once, or for a small period of time.

Option #4, in my opinion, is not a great option. The speed can greatly vary during a download, so any estimate you would do initially would be, with a significant probability, wrong.

Artefacto
setting the timeout to 0 is not desirable at all, "just in case".Do Amazon and Google have an API for setting/expiring these tokens? I'm familiar enough with the purpose of these services, but haven't looked into the implementation yet.
Chrisbloom7
@Chris I think you'll have to write a small program that runs in EC2/AppEngine that does that job.
Artefacto
set_time_limit(0) really is the best solution. "just in case" isnt an objection based on a concrete reason, and shouldnt be the reason to ignore the most obvious, simplest solution. If your download script is just loading bytes and sending them, there's no way it will get into an endless loop. If it stalls too long on the browser side, the client will disconnect, and the PHP script will be ended by the server. If you still dont want to do it, set it to something like 6 hours, this way you know for sure that the server will clean out any hanging php processes.
GrandmasterB
I totally disagree with you on this. Why risk having the server hang from some silly error causing an endless loop. I pride myself on having plenty of safety checks in my code, but I would never leave a door like that open. The only time I ever set it zero is for scripts I'm running from the command line.
Chrisbloom7
@Chris It's not particularly easy to create an endless loop in PHP... But even if there was one, so what? It wouldn't be the end of the word.
Artefacto
Can't believe that's the general consensus on this point, but I guess it's just a pet peeve of mine then. Why take the risk. It's the weakest part of your code and it's easily avoidable. We have script timeouts for a reason. Anyway, that's not a viable solution to this problem, just a band aid and will not help the performance issues that we're already seeing.
Chrisbloom7
I agree with Chris on this one. I'm equally surprised by `set_time_limit(0)` would be a generally accepted option. Here's a list of things that can go/may wrong just off the top of my head: 1) User disconnects, now your script may hang. 2) Attacker stalls the connection deliberately and deplete your server's resources. 3) The program gets into an endless loop <-- This can be extremely easy. Any non-trivial applications use a bunch of libraries, how do you know they don't get into endless loops? `set_time_limit(99999)` might have made some sense but `set_time_limit(0)` is really no-no to me.
kizzx2
@kizzx2 1) if the user disconnects abnormally, PHP will still be able to detect it once it sends data, then stopping the script. 2) This is still possible even without PHP. A user throttling his connection to 0.5 KB/s can make a 2 MB image take over 1 hour to download. 3) If you're in an endless loop and sending data to the client, the client will eventually disconnect and kill the PHP script. If you're not sending data, well... it's just another bug; an easy one to detect and not particularly dangerous.
Artefacto
@Artefacto Those are valid points. I tried hard to come up with more possible erorr scenarios because `set_time_out(0)` looks extremely ugly. Yet if PHP is smart enough most of the time, as you described, to detect all errorneous situations, then this default time-out thing seems unnecessary to me. Given that the hosting HTTP servers have their own timeouts, deducing from what you said, `set_time_out(0)` should probably become one of those mythical "best practices" and this time-out should probably be removed from the language.
kizzx2
A: 

I think the main problem is serving the file thourgh a PHP script. Not only you will have the timeout problem. Also there is a web server process running while the file is being sent to the client.

I would recommend some kind of #1. It don't has to be a CDN but the PHP script should redirect directly to the file. You might check the bypass using a rewrite rule and a param that will check if the param and the current request time match.

Kau-Boy
Thanks for the suggestion. I'm leery of putting the files in a public facing folder as many of them are EXE files. I know I should be able lock them down with proper ACL/permissions but the permissions on this server occasionally get lost (or at least poorly copied) with the way they mirror their content. (Lots of little problems with this client tend to snowball into bigger problems and we can't fix them all at once.)
Chrisbloom7
Than use a unix system that does not execute .exe files or pack them into archives.
Kau-Boy
The merits of that solution hadn't escaped me, but I addressed the problems with switching to a *nix server in my original question.
Chrisbloom7
A: 

I think you might do something like #1 except keep it on your servers and bypass serving it via php directly. After whatever auth/approval needs to happen with php have that script create a temporary link to the file for dowwnload via traditional http. If on a *nix id do this via a symlink to the real file and have a cron job run every n minutes to clear old links to the file.

prodigitalson
+1  A: 

I am a bit reserved about #4. An attacker could forge a fake AJAX request to set your timeout to a very high value, then he can get you into an infinite loop. (If you were worried about that in the first place)

I would suggest a solution similar to @prodigitalson. You can make directories using hash values /downloads/389a002392ag02/myfile.zip which symlinks to the real file. Your PHP script redirects to that file which gets served by HTTP server. The symlink gets deleted periodically.

The added benefit for creating directory instead of a file is that end user doesn't see a mangled file name.

kizzx2
Thanks for that suggestion. I do currently have a min/max limit on the timeout calculation, so that would take care of any spoofing in the AJAX scenario.
Chrisbloom7
I do like the idea of using temp folders/files. The only drawback that I can see with that solution is that it relies on the load balancer having those sticky IPs. If we disable that later (which I hope to do at some point - there is a technical reason for having them in place now which requires some work to get rid of) then that might not work as the initial request could go to one server and the refresh go to another. I suppose I could always use the IP address of the originating server in the redirect though...
Chrisbloom7
A: 

You may create a temp file on the disk, or a symlink, and then redirect(using header()) to that temp file. Then a cronjob could come and remove "expired" temp files. The key here is that every download should have a unique temp file associated.

Quamis
+1  A: 

Use X-SENDFILE. Most webservers will support it either natively, or though a plugin (apache).

using this header you can simply specify a local file path and exit the PHP script. The webserver sees the header and serves that file instead.

Evert
I assume the file has to be inside the public facing portion of the website, correct? Or does apache figure it out and serve it somehow even if it's outside the web root? I'm purposely keeping these files outside of the web root for security sake.
Chrisbloom7
So looking at this option, it sounds like it will do the job. I just need to make sure it works under Windows. I will try it out and report back. Thanks for the tip.
Chrisbloom7
Grrr, except I don't seem to be able to compile this under Snow Leopard properly. apxs compiles it properly, but Apache refuses to start. Says either "API module structure 'xsendfile_module' in file /Applications/MAMP/Library/modules/mod_xsendfile.so is garbled - expected signature 41503230 but saw 41503232 - perhaps this is not an Apache module DSO, or was compiled for a different Apache version?" or "Cannot load /Applications/MAMP/Library/modules/mod_xsendfile.so into server: cannot create object file image or add library" depending on what arch flags I set for apxs.
Chrisbloom7
You're better off using macports :).
Evert
Using macports for what? If you mean to run a MAMP setup, I wrestled with MacPorts for months to get it to work properly. The first time I had to completely wipe my MacBook and start over. The second time it totally fubared some dynamic library which led to the entire /var folder being corrupted. I very nearly lost the whole thing but was able to boot from the OSX DVD and repair the damage. After that, I tried MAMP Pro and fell in love with it. Couldn't be easier to get a nice MAMP environment running. I use MacPorts for managing some things like git and ruby, but I'm really happy with MAMP.
Chrisbloom7
Besides, MacPorts won't help me in this case if I can't compile the mod_xsendfile module anyway
Chrisbloom7
You can also use `virtual` if you're using an Apache module.
Artefacto
There's a good chance macports has mod_xsendfile, but don't take it out on me if it didn't work. Just trying to help here :)
Evert
Not taking it out on you, @Evert. Just saying MacPorts wont help me in this case (it's not there, I searched for it already). I have since found out that it actually compiles OK for the native Apache instance in Snow Leopard, but MAMP Pro's version of Apache doesn't load it. So I've taken it up with them.
Chrisbloom7
Despite not being able to get X_Sendfile setup in my local MAMP environment, the Windows binary version "just worked" in Windows, and it definitely does the job. I added a small test to make sure that the mod_xsendfile module is loaded in Apache (just in case I can't get it working locally). If it is, I send the file to Apache to finish downloading and don't have to bother upping the script timeout. Or if it's not available I just fall back to the old code that forced the download, except I dropped the connection speed to 50Kb/sec in my calculation for the script timeout limit.
Chrisbloom7
Oh, this solution has the added benefit of meeting all of the requirements of the spec: files outside the web root, validate and log each download attempt. Plus it required very little changes to the code. Surprised that I've never heard of this module before!
Chrisbloom7
Ya it rocks =) Glad I could help
Evert