tags:

views:

79

answers:

2

I need to generate a tar file but as a string in memory rather than as an actual file. What I have as input is a single filename and a string containing the assosiated contents. I'm looking for a python lib I can use and avoid having to role my own.


A little more work found these functions but using a memory steam object seems a little... inelegant. And making it accept input from strings looks like even more... inelegant. OTOH it works. I assume, as most of it is new to me. Anyone see any bugs in it?

+1  A: 

The standard tarfile module provides for the creation of .tar files.

Added in response to comment

The standard StringIO module allows the creation of file-like objects that can be written to as if they were files but are backed by strings.

msw
Doesn't answer the question. That seems to work exclusively with real files. I need to work without touching the file system.
BCS
-(-1): OK, but see my edits and link. In short: seems kinda inelegant.
BCS
+5  A: 

Use tarfile in conjunction with cStringIO:

c = cStringIO.StringIO()
t = tarfile.open(mode='w', fileobj=c)
# here: do your work on t, then...:
s = c.getvalue()   # extract the bytestring you need
Alex Martelli
Seems like a bit of a rock+nail solution. It works, but it seems there should be a better way.
BCS
Alex Martelli
I know of the power of modularity, but from what little I know of tar, you should be able to do what I want in about 3 lines of code with nothing fancier than a sum function. And in this case, the difference between the modular solution and the direct one is kinda substantial.
BCS
@BCS, this answer _is_ 3 lines of code.
carl
@BCS, what's `sum` gotta do with it?! Anyway, as @carl points out, these _are_ three lines of code (plus one with just a comment;-) so your observation about the "difference [being] kinda substantial" **totally** escapes me. What "direct one"? Forcing **every** tool that reads or writes files to have `fromstring=contents` and `tostring=True` in their `open`?! And how would the latter return the final results after all operations? And, if strings, why not URLs? Oh, and sockets, too? And what else...? This way madness lies... to save **what**, maybe **ONE** (expletive deleted) **line**?!
Alex Martelli
@carl: how big is tarfile? I'm not talking about the code I write but the extra code I use.
BCS
@Alex M. The solution I'm thinking of would amount to `SomeFormatString % (filename, len(data), const + sum(filename) + sum("%d"%len(data)), data)` A slight more readable/correct version of that would be ~3LOC and use nothing outside the core language.
BCS
@BCS, that code does **not** "generate a tar file in memory" -- at most, it formats the contents of **one** "data with filename" in an arbitrary file (and hard-codes details about the `tar` standard smack in the middle of application logic, very worst place for them -- especially given the repetitiousness wrt the fact that said knowledge's **already** in Python, and who could possibly care about whether it's "in the core language"?! The standard library is **just as core** as any other part. Plus, the `sum` calls will raise exceptions as they're called with string arguments. Bleagh!-)
Alex Martelli
BCS
Alex Martelli
What I'm asserting is that the form I'm wishing for (forlornly at this point) is the one that should have been used to build the other because it is the simpler primitive. Building a tar file on disk that contains multiple files is **trivial** given code to generate, in memory, the header for a single file (multi-file tar files can be built by appending single-file tar files if you cut off the right amount of padding). And from that, what I'm looking for is literally a single line of code: `header + content + somethingToZeroPad()`
BCS
@BCS, OK, but _the_ right module for such a hypothetical function is obviously and exclusively `tarfile` (would make no sense as a built-in, in particular, when even features **much** more often used, like REs, aren't!-) and therefore your remark about "stuff that doesn't need an import" is **way** off-base. As a proposed patch to `tarfile` (for 3.2 or later, earlier versions being feature-frozen), feel free to propose it on the python-dev mailing list, of course (actually, propose it any way you like, but anywhere **but** tarfile will be shot down with 100% probability;-).
Alex Martelli
My comment on import was just about what is/isn't "core language" and the point about "core language" was just pointing out the difference in the amount of code involved between what I was looking for and what it looks like I'm going to get. I have very little issue with importing code that isn't used.
BCS
@BCS, if you're fine with doing `import` anyway (despite your previous comment saying essentially the reverse), then importing the 6-lines-or-so module implied by my response is not really different from importing a module of 4 or 5 lines needed, at the least, to use hypothetical new primitives offered in `tarfile`. The latter couldn't use those primitives you want, btw, without reading entirely in memory each and every entire file being archived, which may be impractical to impossible for sufficiently-large files, so the total amount of code could only grow.
Alex Martelli
This is getting a bit long in the tooth. All I'm trying to say at this point is that I really wish that the choice of primitive was different. It seems to me that there would be less overhead in a system that assumes things are strings than one that assumes they are files. Heck if you can make a string look like a file without writing it out to disk, why not make a file look like a string without actually reading it all in first?
BCS
@BCS, yep, this thread is far too long, but IMHO the reason is that you seem to keep making wrong assertions in your argument, I shoot them down, and as you can't defend them you "sweep them aside", add more wrongness, and stubbornly keep insisting you're right (while you're "wrong as a doornail";-). E.g., yr latest: "make a file look like a string w/o reading it all in memory" would be a ridiculously slow operation: e.g, each of `len` and checksumming would be re-reading the **entire** file! Even `mmap` (at kernel level and only on some OSs) has a performance hit, userland would be _EEP_ bad.
Alex Martelli
The OS can do "`len`" without reading any file data. Scratch that. Only the checksum of the header is needed. Scratch that. By "make a file look like a string w/o reading it all in memory" I meant exactly that: it wouldn't read byte 1 into RAM until you look at it. doing a `f.write(file_string)` would do a chunked copy (or whatever tarfile does right now). --- And yes, I know this isn't ever going to happen as it would require breaking changes to the string API of all things. I'm not wrong, I'm just considering how python would have been designed if it has been optimized for different goals.
BCS