Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.
The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.
I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).
There is a solution implemented in ArchView "ArchView can open archive file online without downloading the whole archive." https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.