views:

443

answers:

4

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.

Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?

A: 

You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.

I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.

jeremiahd
A: 

To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.

Python zipfile module

Craig.Nicol
+7  A: 

There's nothing that will automatically do what you want.

However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.

#!/usr/bin/python

import zipfile
f = zipfile.ZipFile('myfile.zip')

for subfile in f.namelist():
    print subfile
    data = f.read(subfile)
    for line in data.split('\n'):
        print line
Mark Harrison
A: 
Chris Conway
Absolutely not. Zip files are not stored with huffman coding, but with dictionary-based coding. The encoding of symbols depends on encodings of other symbols and their frequency in the source. So this technique has no chance to work, whatsoever.
Eli Bendersky
eliben, I see no proof of impossibility in your comment, whatsoever. Perhaps this margin is too narrow to contain it?
Chris Conway