views:

124

answers:

5

I have a script that lets the user upload text files (PDF or doc) to the server, then the plan is to convert them to raw text. But until the file is converted, it's in its raw format, which makes me worried about viruses and all kinds of nasty things.

Any ideas what I need to do to minimize the risk of these unknown files. How to check if it's clean, or if it's even the format it claims to be and that it does not crash the server.

+1  A: 

Hum - imho you should not really have to worry about the document type or something; if you use a good converter to convert to raw text then this one ought to do those checks without crashing the server.

As known from your client computer, servers should always be protected against viruses and attacks - so the newly uploaded file is to be checked before processing it.

I've never seen a web app doing those kinda checks itself - hav you?

dhh
+1  A: 

IMHO, until something tries to execute it, it's just a file. However, you can definitely check (but do not rely upon, as clarified below) the file extension, and could also research the file formats to see if there are any characteristic sequences of bytes in the header of the file that you could verify.

Aerik
No, no, no! NEVER rely on the file extension to tell you anything. Use finfo_file() in PHP > 5.3 or mime_content_type() in older versions.
Cfreak
Sorry, I wasn't clear - I would never *rely* on the file extension, I was just saying that you could certainly exclude anything that has the *wrong* extension. Of course, not putting any trust into the matching extension you do find, the check might just be considered useless overhead. Maybe useful if you just want to tell users, "Sorry, I only accept docs and pdfs".
Aerik
+1  A: 

If you're viewing the PDF, there is nothing you can do besides get antivirus and pray that it catches maliciously a formed PDF.

Conversion software normally isn't targeted though, so if you just convert it and view the text format output, you should be somewhat safer.


Oh, you are worried about the server. Just don't execute the uploaded files...

Longpoke
+3  A: 

As I commented to Aerik but it's really the answer to the question.

If you have PHP >= 5.3 use finfo_file(). If you have an older version of PHP you can use mime_get_contents() (less reliable) or load the Fileinfo extension from PECL.

Both of these functions return the mime type of the file (by looking at the type of data inside them). For PDF it should be

text/pdf

For a word doc it could be a few things. Generally it should be

application/msword

If your server is running *nix then make sure the files you're saving aren't executable. Even better: save them to a folder that isn't accessible by the web server. You can still write code to access the files but someone requesting a web page won't be able to access them at all.

Cfreak
+1  A: 

If you've ever opened or executed any user-uploaded file on the server, you should expect that your server is now compromised.

Even a JPG can contain executable php. If you include or require the file in any way in your script, that can also compromise your server. An image you stumble upon on the web served like so...


header('Content-type: image/jpeg');
header('Content-Disposition: inline; filename="test.jpg"');

echo file_get_contents('/some_image.jpg');
echo '<?php phpinfo(); ?>';

... which you save and re-host on your own server like so...


$q = $_GET['q']; // pretend this is sanitized for the moment
header('Content-type: '.mime_content_type($q));
header('Content-Disposition: inline; filename="'.$_GET['q'].'"');

include $q;

...will execute phpinfo() on your server. Your site users can then simply save the image to their desktop and open it with notepad to see your server settings. Simply converting the file to another format will discard that script, and should not trigger any actual virus attached to the file.

It might also be best to do a virus search on upload. You should be able to do an inline system command to a checker and parse its output to see if it finds any. Your site users should be checking files they download anyway.

Otherwise, even a virus laiden user uploaded file just sitting there on your server shouldn't harm anything... as far as I know.

bob-the-destroyer