views:

2237

answers:

14

As a primarily Windows developer, perhaps I'm missing something cultural in the Linux community, but it has always confused me when downloading something that the files are first put into a .tar archive, then zipped. Why the two step process? Doesn't zipping achieve the file grouping? Is there some other benefit that I'm not aware of?

+80  A: 

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).

Stewart Johnson
Answered faster, and a little better than I did - guess I'll go back to work!
Harper Shelby
I beat you by 4 minutes, and that's an eternity in stack overflow time. :-)
Stewart Johnson
It's worth noting that both tar and gzip are useful on their own which is why they're separated. With some clever use of pipes, I once moved a folder from one computer to another by tarring, zipping and piping over SSH, then unzipping and untarring on the destination. One command, no temp files.
rmeador
So basically you are saying that *nix people inflict tarballs on the world because of a stubborn refusal to adjust their behaviour to use modern tools and techniques?
David Arno
You could also say that they stick with what has been proven to work well, rather than changing things and breaking compatibility. .tar.gz can be done in a single step regardless just like so-called modern tools and techniques (Please Register WinZip).
JeeBee
@David please clarify inflict. Its only an issue if you're using out of date tools yourself. Most modern tools will treat zips and tarballs the same way. gzip compression is pretty much the same as zip, bzip2 is generally smaller.
Steve g
@Steve g, looks like I'm hoisted by my own petard there :)
David Arno
GNU tar has inbuilt support for gzip and bzip2 in a way that you can teach tar to call them yourself: Using -z or -j command line options, the lazy user can build and unpack a tar.gz / tar.bz2 on-the-fly, without a need to pipe herself.
ypnos
create bzip: tar -jcpvf foo.tar.bz2 <dir>unpack bzip: tar -jxvf foo.tar.bz2create gz: tar -zcpvf foo.tar.gz <dir>unpack gz: tar -zxvf foo.tar.gz
dicroce
tar is a format for combining groups of files in a single file. There is no reason why this should require it to be compressed. A modern replacement for tar would probably be ISO9660 rather than zip.That you can optionally compress it with any algorithm you choose is one of the benefits of the Unix design.
Martin Beckett
And lets dont' forget that the tar+gzip approach is both computationally cheaper, and achieves better compression, than winzip. (Although, I'll admit to not actually benchmarking that...)
Arafangion
+1  A: 

Tar = Groups files in 1 files

GZip = Zip the file

They split the process in 2. That's it.

In the Windows environnement that you might be more used to use the WinZip or WinRar that do a Zip. The Zip process of these software do group the file and zipping but you simply do not see that process.

Daok
It's not the best explanation, given that the "zip" files the OP is used to in Windows, already incorporate the grouping.
Gareth
+2  A: 

gzip and bzip2 is simply a compressor, not an archiver-software. Hence, the combination. You need the tar-software to bundle all the files.

ZIP itself, and RAR aswell are a combination of the two processes.

jishi
+3  A: 

Usually in the *nux world, bundles of files are distributed as tarballs and then optionally gzipped. Gzip is a simple file compression program that doesn't do the file bundling that tar or zip does.

At one time, zip didn't properly handle some of the things that Unix tar and unix file systems considered normal, like symlinks, mixed case files, etc. I don't know if that's changed, but that's why we use tar.

Paul Tomblin
*nux - Linux, Unux, Solarnux?
mackenir
@mackenir - don't forget POSUX. :-)
Paul Tomblin
@mackenir - Or should that be POSNUX?
Paul Tomblin
+2  A: 

In the Unix world, most applications are designed to do one thing, and do it well. The most popular zip utilities in Unix, gzip and bzip2, only do file compression. tar does the file concatenation. Piping the output of tar into a compression utility does what's needed, without adding excessive complexity to either piece of software.

Harper Shelby
+4  A: 

I think you were looking for more of historical context to this. The original zip was for a single file. Tar is used to place multiple files into a single file. Therefore tarring and zipping is the two step process. Why it is still so dominant today is anyone's guess.

From wikipedia for Tar_ (file_format)

In computing, tar (derived from tape archive) is both a file format (in the form of a type of archive bitstream) and the name of the program used to handle such files. The format was standardized by POSIX.1-1988 and later POSIX.1-2001. Initially developed as a raw format, used for tape backup and other sequential access devices for backup purposes, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.

martinatime
+15  A: 

It's odd that no-one else has mentioned that modern versions of GNU tar allow you to compress as you are bundling:

tar -czf output.tar.gz directory1 ...

tar -cjf output.tar.bz2 directory2 ...

You can also use the compressor of your choosing provided it supports the '-c' (to stdout, or from stdin) and '-d' (decompress) options:

tar -cf output.tar.xxx --use-compress-program=xxx directory1 ...

This would allow you to specify any alternative compressor.

[Added: If you are extracting from gzip or bzip2 compressed files, GNU tar auto-detects these and runs the appropriate program. That is, you can use:

tar -xf output.tar.gz
tar -xf output.tgz        # A synonym for the .tar.gz extension
tar -xf output.tar.bz2

and these will be handled properly. If you use a non-standard compressor, then you need to specify that when you do the extraction.]

The reason for the separation is, as in the selected answer, the separation of duties. Amongst other things, it means that people could use the 'cpio' program for packaging the files (instead of tar) and then use the compressor of choice (once upon a time, the preferred compressor was pack, later it was compress (which was much more effective than pack), and then gzip which ran rings around both its predecessors, and is entirely competitive with zip (which has been ported to Unix, but is not native there), and now bzip2 which, in my experience, usually has a 10-20% advantage over gzip.

[Added: someone noted in their answer that cpio has funny conventions. That's true, but until GNU tar got the relevant options ('-T -'), cpio was the better command when you did not want to archive everything that was underneath a given directory -- you could actually choose exactly which files were archived. The downside of cpio was that you not only could choose the files -- you had to choose them. There's still one place where cpio scores; it can do an in-situ copy from one directory hierarchy to another without any intermediate storage:

cd /old/location; find . -depth -print | cpio -pvdumB /new/place

Incidentally, the '-depth' option on find is important in this context - it copies the contents of directories before setting the permissions on the directories themselves. When I checked the command before entering the addition to this answer, I copied some read-only directories (555 permission); when I went to delete the copy, I had to relax the permissions on the directories before 'rm -fr /new/place' could finish. Without the -depth option, the cpio command would have failed. I only re-remembered this when I went to do the cleanup - the formula quoted is that automatic to me (mainly by virtue of many repetitions over many years). ]

Jonathan Leffler
An expanded ZIP format could accommodate plug-in stream compressors just as much as TAR, but without a suitable IPC protocol it would likely be slower due to excessive exec'ing of subprocesses.
Barry Kelly
my goto command: tar -xfvz tarpkg.tar.gz (replace the z with a j for bz2 compressed archives)
Redbeard 0x0A
@Redbeard: tar auto-detects gzip and bzip2 on extract -- for create, you have to tell it what to do, but I just use -xf (or -xvf) and the tar file name.
Jonathan Leffler
+11  A: 

An important distinction is in the nature of the two kinds of archives.

TAR files are little more than a concatenation of the file contents with some headers, while gzip and bzip2 are stream compressors that, in tarballs, are applied to the whole concatenation.

ZIP files are a concatenation of individually compressed files, with some headers. Actually, the DEFLATE algorithm is used by both zip and gzip, and with appropriate binary adjusting, you could take the payload of a gzip stream and put it in a zip file with appropriate header and dictionary entries.

This means that the two different archive types have different trade-offs. For large collections of small files, TAR followed by a stream compressor will normally result in higher compression ratio than ZIP because the stream compressor will have more data to build its dictionary frequencies from, and thus be able to squeeze out more redundant information. On the other hand, a (file-length-preserving) error in a ZIP file will only corrupt those files whose compressed data was affected. Normally, stream compressors cannot meaningfully recover from errors mid-stream. Thus, ZIP files are more resilient to corruption, as part of the archive will still be accessible.

Barry Kelly
+3  A: 

Another reason it is so prevalent is that tar and gzip are on almost the entire *NIX install base out there. I believe this is probably the single largest reason. It is also why zip files are extremely prevalent on Windows, because support is built in, regardless of the superior routines in RAR or 7z.

GNU tar also allows you to create/extract these files from one command (one step):

  • Create an Archive:
  • tar -cfvj destination.tar.bz2 *.files
  • tar -cfvz destination.tar.gz *.files

  • Extract an Archive: (the -C part is optional, it defaults to current directory)

  • tar -xfvj archive.tar.bz2 -C destination_path
  • tar -xfvz archive.tar.gz -C destination_path

These are what I have committed to memory from my many years on Linux and recently on Nexenta (OpenSolaris).

Redbeard 0x0A
Actually, it's the other way round: zip is built in Windows *now* because it was prevalent in DOS and early versions of Windows.
Christian Lescuyer
I like to use RAR on Windows, tar.bz2 on Linux
Osama ALASSIRY
You might note the weird (not compatible with getopt()) option parsing, and the dash is optional because tar pre-dates the standard conventions of Unix command options.
Jonathan Leffler
A: 

tar is popular mostly for historic reasons. There are several alternatives readily available. Some of them are around for nearly as long as tar, but couldn't surpass tar in popularity for several reasons.

  • cpio (alien syntax; theoretically more consistent, but people like what they know, tar prevailed)
  • ar (popular a long time ago, now used for packing library files)
  • shar (self extracting shell scripts, had all sorts of issues; used to be popular never the less)
  • zip (because of licensing issues it wasn't readily available on many Unices)

A major advantage (and downside) of tar is that it has neither file header, nor central directory of contents. For many years it therefore never suffered from limitations in file-size (until this decade where an 8 Gb limit on files inside the archive became a problem, solved years ago).

Apperantly the one downside of tar.gz (or ar.Z for that matter), which is that you have to uncompress the whole archive for extracting single files and listing archive contents, never hurt people enough to make them defect from tar in significant numbers.

edgar.holleis
Old versions of tar did not have header information; new (POSIX-compatible, USTAR) versions of tar (eg GNU tar) do. The 'file' command understands this.
Jonathan Leffler
A: 

Tar is not only a file format, but it is a tape format. Tapes store data bit-by-bit. Each storage implementation was custom. Tar was the method by which you could take data off a disk, and store it onto tape in a way that other people could retrieve it without your custom program.

Later, the compression programs came, and *nix still only had one method of creating a single file that contained multiple files.

I believe it's just inertia that has continued with the tar.gz trend. Pkzip started with both compression and archival in one fell swoop, but then DOS systems didn't typically have tape drives attached!

From wikipedia for Tar_ (file_format)

In computing, tar (derived from tape archive) is both a file format (in the form of a type of archive bitstream) and the name of the program used to handle such files. The format was standardized by POSIX.1-1988 and later POSIX.1-2001. Initially developed as a raw format, used for tape backup and other sequential access devices for backup purposes, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.

Kieveli
Strictly speaking tar is a file format - it's just that on unix tapes are just another file.
Martin Beckett
Actually, tar was not a file format to begin with. Tapes did not have file systems, so tar was created as a patch for not having a file system.
Kieveli
+5  A: 

The funny thing is, you can get behaviour not anticipated by the creators of tar and gzip. For example, you can not only gzip a tar file, you can also tar gzipped files, to produce a files.gz.tar (this would technically be closer to the way pkzip works). Or you can put another program into the pipeline, for example some cryptography, and you can choose an arbitrary order of tarring, gzipping and encrypting. Whoever wrote the cryptography program does not have to have the slightest idea how his program would be used, all he needs to do is read from standard input and write to standard output.

Svante
...and this is the true strength of the unix philosophy.
Dave Sherohman
+1  A: 

For the same reason why mac users love disk images: They are a really convenient way to archive stuff and then pass it around, up-/download or email it etc.

And easier to use and more portable than zips IMHO.

Tobias
A: 

In my Altos-XENIX days (1982) we started using tar (tape archiver) to extract files from 5 1/4 floppies or streaming tape as well as copy to these media. It's functionality is very similar to the BACKUP.EXE and RESTORE.EXE commands in DOS 5.0 and 6.22 as supplements, allowing you to span multiple media if it couldn't fit in only one. The drawback was that if one of the multiple media had problems, the whole thing was worthless. tar and dd originate from UNIX SYstem III and has remained a standard release utility with UNIX-like OS' probably for backward compatibility reasons.

Frank Computer