views:

304

answers:

4

My upload form expects a tar file and I want to check whether the uploaded data is valid. The tarfile module supports is_tarfile(), but expects a filename - I don't want to waste resources writing the file to disk just to check if it is valid.

So, is there a way to check the data is a valid tar file without writing to disk, using standard Python libraries?

+2  A: 

The tar file format is here on Wikipedia.

I suspect your best bet would be to check that the header checksum for the first file is valid. You may also want to check the file name for sanity but that may not be reliable, depending on the file names that have been stored in there.

Duplicating the relevant information here:

Offset  Size  Description
     0   100  File name
   100     8  File mode
   108     8  Owner's numeric user ID
   116     8  Group's numeric user ID
   124    12  File size in bytes
   136    12  Last modification time in numeric Unix time format
   148     8  Checksum for header block
   156     1  Link indicator (file type)
   157   100  Name of linked file

The checksum is calculated by taking the sum of the unsigned byte values of the header block with the eight checksum bytes taken to be ASCII spaces (decimal value 32).

It is stored as a six digit octal number with leading zeroes followed by a null and then a space.

Various implementations do not adhere to this, so relying on the first white space trimmed six digits for checksum yields better compatibility. In addition, some historic tar implementations treated bytes as signed.

Readers must calculate the checksum both ways, and treat it as good if either the signed or unsigned sum matches the included checksum.

There is also the UStar format (also detailed in that link) but, since it's an extension to the old tar format, the method detailed above should still work. UStar is generally for just storing extra information about each file.

Alternatively, since Python is open source, you could see how is_tarfile works and adapt it to check your stream rather than a file. The source code is available here under Python-3.1.1/Lib/tarfile.py but it's not for the faint of heart :-)

paxdiablo
Is there a convention for encoding non-ASCII file names? That article mentions the problem, but doesn't mention a solution.
John Machin
+2  A: 

The class TarFile accepts a fileobj object. I guess you can pass any partial download entity you get from your web framework.

__init__(self, name=None, mode='r', fileobj=None)

Adding to paxdiablo post: tar is a very difficult and complex file format, despite its apparent simplicity. You can check basic constraint, but if you have to support all the possible existing tar dialects you are going to waste a lot of time. Most of its complexity comes from the following issues:

  • absence of a real standard until a de-facto standard existed (UStar/pax)
  • holes in the specification leaving vendors grey areas where each one implemented their own solution
  • vendors saying "our tar is better, and it will take over t3h world"
  • limitations, and workarounds for these limitations (e.g. filename length)

Also, there format has no upfront header, so the only way to check if the whole archive is sane is to scan the file completely, catch each record, and validate each one.

Stefano Borini
ah, you've beaten me by half a minute :-)
Eli Bendersky
not really, your method is another (probably better) way to achieve the same.
Stefano Borini
+2  A: 

The open method of tarfile takes a file-like object in its fileObj argument. This can be a StringIO instance

Eli Bendersky
+2  A: 

Say your uploaded data is contained in string data.

from tarfile import TarFile, TarError
from StringIO import StringIO

sio = StringIO(data)
try:
    tf = TarFile(fileobj=sio)
    # process the file....
except TarError:
    print "Not a tar file"

There are additional complexities such as handling different tar file formats and compression. More info is available in the tarfile documentation.

mhawke