views:

125

answers:

4

Is it possible to read the contents of a .ZIP file without fully downloading it?

I'm building a crawler and I'd rather not have to download every zip file just to index their contents.

Thanks;

+2  A: 

the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.

GZip formats are able to be read as a stream I believe.

Anon
Yes, zip headers are at the end. You need the whole file _OR_ a downloader that lets you get specific parts.
Henk Holterman
GZip can be read as a stream, but all it is is a compressed stream. gzip doesn't have any type of container or multiple files, that's why `.tar.gz` is used: `.tar` combines files and `.gz` compresses them.
Sam
it's more like a footer then.
Gary
ah yes, i should not have implied gzip == zip. good call.
Anon
+5  A: 

The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.

The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.

Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.

HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.

mdma
That's a really neat idea. I wasn't aware of HTTP 1.1 allowing ranges..
Earlz
@Earlz - The HTTP/1.1 resource ranges is the backbone of downloaders and interruptable/pausable downloads. See 14.36 Range in http://www.ietf.org/rfc/rfc2068.txt
mdma
A: 

I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).

Joe Raby
A: 

There is a solution implemented in ArchView "ArchView can open archive file online without downloading the whole archive." https://addons.mozilla.org/en-US/firefox/addon/5028/

Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.

André Ricardo
Also did you manage to solve this problem?
André Ricardo