views:

151

answers:

2

Assume you have an html form with an input tag of type 'file'. When the file is posted to the server it will be stored locally, along with relevant metadata.

I can think of three ways to determine the mime type:

  • Use the mime type supplied in the 'multipart/form-data' payload.
  • Use the file name supplied in the 'multipart/form-data' payload and look up the mime type based on the file extension.
  • scan the raw file data and use a mime type guessing library.

None of these solutions are perfect.

Which is the most accurate solution?
Is there another, better option?

+1  A: 

If you are using PHP then you can use

http://pecl.php.net/package/Fileinfo

Which will inspect many aspects of the file. For Python you can use

http://pypi.python.org/pypi/python-magic/0.1

Which is the bindings for libmagic on Linux/Unix and possibly Windows? systems. See:

man magic
man libmagic

On Linux. It uses magic number tests to try and assert mime-types of files.

I like the magic number method, because it can catch wrong extensions and alot of trickery if you are handling files on a webserver that are uploaded. These tests are generally one-offs so the performance hit of reading through the file is negligible.

Aiden Bell
+1  A: 

I don't think you can rely on any one of these as being the definite "I am mime type x". The problem with the first two are that the content type supplied may be incorrect, because of issues with the client (browser or otherwise) or a misleading request (various hack attempts etc...) from various clients.

So you should probably try and combine information from each type and work out some sort of confidence level. Iif the file extension says .doc and the mime type is application/msword then there's a pretty good chance it's a word document, but run it through a mime type detection utility just to make sure.

There should be a solution available for mime magic detection with the language you're using - you didn't mention which one though. They all generally work by looking at the first few bytes/characters of the file and match them against a lookup table of mime types. Some also remove the BOM from the file to help with this. Often they fall back to plain text if the mime type can't be detected.

If you want a platform independent approach to this then take a look at the various Java libraries that exist:

Jon