ansaurus

Question

How can I compare two zip format(.tar,.gz,.Z) files in Unix

Answer 1

+2 A:

Svisstack 2010-07-05 10:55:49

gzip is a stream compressor, right? So in theory he could decompress both files in parallel (in memory only) and compare line by line. It would still technically be decompressing, but not creating a decompressed file.. I assume the reason he doesn't want to decompress is because of file size.

Blorgbeard 2010-07-05 11:06:24

But i have files with size of more than 2GB. if i compare those huge files it taking more time and space. any other suggestion you have for this

gyrous 2010-07-05 11:08:25

Answer 2

+1 A:

You cannot compare the files while they remain compressed using different techniques.

You must first decompress the files, and then find the difference between the results.

Decompression can be done with gunzip, tar, and uncompress (or zcat).

Finding the difference can be done with the diff command.

Oddthinking 2010-07-05 11:05:36

thanks Oddthinking. but my problem is file size..

gyrous 2010-07-05 11:34:01

Answer 3

+1 A:

If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.

My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.

Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.

EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.

As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.

Carl Smotricz 2010-07-05 12:21:05

Answer 4

+1 A:

I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:

comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)

or if the shell doesn't support that, perhaps:

zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0

Adrian 2010-07-05 12:40:31

Adrian. thanks alot it works fine. but if i have several column means how can i compare. for eg: i want to compare 1 st column in 1st file and 3 rd column in 2nd file means? kindly give your suggesstion

gyrous 2010-07-05 13:09:26

It would be helpful if you could give a small example similar to your original question (or fix the question if it wasn't quite right).

Adrian 2010-07-05 13:25:09

for eg: file 1 1111,589,3698, 2222,598,4589, 3333,478,2695, 4444,258,3694, file 2 589,3698,1111598,4589,2222478,2695,3333 258,3694,4444 comparing column 1 of 1st file with column 3 of 2nd file. i want output like this 1111,589,3698, 2222,598,4589, 3333,478,2695, 4444,258,3694, or like this589,3698,1111 598,4589,2222 478,2695,3333 258,3694,4444 you can print any output of input file.

gyrous 2010-07-05 13:37:46

Perhaps this is closer to what you want?join -t',' -1 1 -2 3 <(zcat file1.txt.gz) <(zcat file2.txt.gz)

Adrian 2010-07-05 14:23:27

dear Adrian, join command works only when two files have equal number of rows and which has to be sorted before joining. My files have any number fof column and size of 2GB sorting is not possible.may be i give you my command which i use to compare two text file, without sorting and you can compare any filed in the file you want. nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file1.txt file2.txtusing abouve command i can compare any fileds between two files i want. see that and give any suggestion for comparing zip files like that

gyrous 2010-07-06 05:18:29

If your data isn't sorted and you really don't have space to decompress or sort, I can't see a solution, unless you have a lot of RAM. Your awk command stores the whole of column 1 in memory, but if you want to try it: awk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(zcat file1.txt.gz) <(zcat file2.txt.gz)

Adrian 2010-07-06 06:57:29

If one of your files is smaller than the other, swap them around so the smaller set get stored in memory.

Adrian 2010-07-06 07:07:22

Thanks.. it works great..Adrian may i know ur mail id and phone number.which country do u belong ---- really genius.

gyrous 2010-07-06 10:04:52

dear adrian, i have one more question. can you help me. see my question

gyrous 2010-07-06 13:20:10

adrian i need you help..

gyrous 2010-07-06 14:16:49

Answer 5

A:

exact answer i want is this only

nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)

. instead of awk, nawk works perfectly and it's gzip file so use gzcat

gyrous 2010-07-06 11:26:13

ansaurus

tags:

views:

answers:

How can I compare two zip format(.tar,.gz,.Z) files in Unix

related questions