tags:

views:

373

answers:

4

I've wrote a small cryptographic module in python whose task is to cipher a file and put the result in a tarfile. The original file to encrypt can be quit large, but that's not a problem because my program only need to work with a small block of data at a time, that can be encrypted on the fly and stored.

I'm looking for a way to avoid doing it in two passes, first writing all the data in a temporary file then inserting result in a tarfile.

Basically I do the following (where generator_encryptor is a simple generator that yield chunks of data read from sourcefile). :

t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
for chunk in generator_encryptor("sourcefile"):
   tmp.write(chunks)
tmp.close()
t.add(content)
t.close()

I'm a bit annoyed having to use a temporary file as I file it should be easy to write blocs directly in the tar file, but collecting every chunks in a single string and using something like t.addfile('content', StringIO(bigcipheredstring) seems excluded because I can't guarantee that I have memory enough to old bigcipheredstring.

Any hint of how to do that ?

A: 

I guess you need to understand how the tar format works, and handle the tar writing yourself. Maybe this can be helpful?

http://mail.python.org/pipermail/python-list/2001-August/100796.html

static_rtti
+3  A: 

You can create an own file-like object and pass to TarFile.addfile. Your file-like object will generate the encrypted contents on the fly in the fileobj.read() method.

kaizer.se
Read into tarfile.py in your library. If I read it right, all the file object has to implement is .read() and .close() and it's going to work.
kaizer.se
seems easy enough. I will try that an post listing back if it works.
kriss
the only things I see that you have to solve is that you have to pass the full encrypted file size before you start, and returning the right size chunks, but I suppose you can influence that. It is also valid to return less than the requested size in .read().
kaizer.se
if concurrency was easier to express in programming languages, you could have to just create a pipe (os.pipe()), pass the read end to addfile, and write to the input end. However, it is a complexity fail I think since you have to set up different threads or processes to read and write.
kaizer.se
+2  A: 

Huh? Can't you just use the subprocess module to run a pipe through to tar? That way, no temporary file should be needed. Of course, this won't work if you can't generate your data in small enough chunks to fit in RAM, but if you have that problem, then tar isn't the issue.

unwind
The all point is avoiding subprocesses. I want the full python exception management. I don't want to have to parse stderr to know why tar failed (including out of disk space, can't open new process and such).
kriss
A: 

Basically using a file-like object and passing it to TarFile.addfile do the trick, but there is still some issues open.

  • I need to known the full encrypted file size at the beginning
  • the way tarfile access to read method is such that the custom file-like object must always return full read buffers, or tarfile suppose it's end of file. It leads to some really inefficient buffer copying in the code of read method, but it's either that or change tarfile module.

The resulting code is below, basically I had to write a wrapper class that transform my existing generator into a file-like object. I also added the GeneratorEncrypto class in my example to make code compleat. You can notice it has a len method that returns the length of the written file (but understand it's just a dummy placeholder that does nothing usefull).

import tarfile

class GeneratorEncryptor(object):
    """Dummy class for testing purpose

       The real one perform on the fly encryption of source file
    """
    def __init__(self, source):
        self.source = source
        self.BLOCKSIZE = 1024
        self.NBBLOCKS = 1000

    def __call__(self):
        for c in range(0, self.NBBLOCKS):
            yield self.BLOCKSIZE * str(c%10)

    def __len__(self):
        return self.BLOCKSIZE * self.NBBLOCKS

class GeneratorToFile(object):
    """Transform a data generator into a conventional file handle
    """
    def __init__(self, generator):
        self.buf = ''
        self.generator = generator()

    def read(self, size):
        chunk = self.buf
        while len(chunk) < size:
            try:
                chunk = chunk + self.generator.next()
            except StopIteration:
                self.buf = ''
                return chunk
        self.buf = chunk[size:]
        return chunk[:size]

t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
generator = GeneratorEncryptor("source")
ti = t.gettarinfo(name = "content")
ti.size = len(generator)
t.addfile(ti, fileobj = GeneratorToFile(generator))
t.close()
kriss
After looking to tarfile.py source code it seems easy enough to change the behavior where it expects read to always give back full buffers. I will probably fill it as a bug and propose a corrective patch.
kriss
the limitation on having to know the size before writing can probably be changed also if the underlying tarfile is opened as a real file where you can move around (ie: not a stream going always forward). It only implies writing twice the tarfinfo header, as tarinfo is written before content. It also necessitate some changes in tarfile module (or some derived class).
kriss