tags:

views:

119

answers:

5

I have two gz files. I want to compare those files without extracting. for example:

first file is number.txt.gz - inside that file:

1111,589,3698, 
2222,598,4589, 
3333,478,2695, 
4444,258,3694, 

second file - xxx.txt.gz:

1111,589,3698, 
2222,598,4589, 

I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:

1111,589,3698, 
2222,598,4589,
+2  A: 
Svisstack
gzip is a stream compressor, right? So in theory he could decompress both files in parallel (in memory only) and compare line by line. It would still technically be decompressing, but not creating a decompressed file.. I assume the reason he doesn't want to decompress is because of file size.
Blorgbeard
But i have files with size of more than 2GB. if i compare those huge files it taking more time and space. any other suggestion you have for this
gyrous
+1  A: 

You cannot compare the files while they remain compressed using different techniques.

You must first decompress the files, and then find the difference between the results.

Decompression can be done with gunzip, tar, and uncompress (or zcat).

Finding the difference can be done with the diff command.

Oddthinking
thanks Oddthinking. but my problem is file size..
gyrous
+1  A: 

If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.

My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.

Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.


EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.

As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.

Carl Smotricz
+1  A: 

I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:

comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)

or if the shell doesn't support that, perhaps:

zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0
Adrian
Adrian. thanks alot it works fine. but if i have several column means how can i compare. for eg: i want to compare 1 st column in 1st file and 3 rd column in 2nd file means? kindly give your suggesstion
gyrous
It would be helpful if you could give a small example similar to your original question (or fix the question if it wasn't quite right).
Adrian
for eg: file 1 1111,589,3698, 2222,598,4589, 3333,478,2695, 4444,258,3694, file 2 589,3698,1111598,4589,2222478,2695,3333 258,3694,4444 comparing column 1 of 1st file with column 3 of 2nd file. i want output like this 1111,589,3698, 2222,598,4589, 3333,478,2695, 4444,258,3694, or like this589,3698,1111 598,4589,2222 478,2695,3333 258,3694,4444 you can print any output of input file.
gyrous
Perhaps this is closer to what you want?join -t',' -1 1 -2 3 <(zcat file1.txt.gz) <(zcat file2.txt.gz)
Adrian
dear Adrian, join command works only when two files have equal number of rows and which has to be sorted before joining. My files have any number fof column and size of 2GB sorting is not possible.may be i give you my command which i use to compare two text file, without sorting and you can compare any filed in the file you want. nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file1.txt file2.txtusing abouve command i can compare any fileds between two files i want. see that and give any suggestion for comparing zip files like that
gyrous
If your data isn't sorted and you really don't have space to decompress or sort, I can't see a solution, unless you have a lot of RAM. Your awk command stores the whole of column 1 in memory, but if you want to try it: awk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(zcat file1.txt.gz) <(zcat file2.txt.gz)
Adrian
If one of your files is smaller than the other, swap them around so the smaller set get stored in memory.
Adrian
Thanks.. it works great..Adrian may i know ur mail id and phone number.which country do u belong ---- really genius.
gyrous
dear adrian, i have one more question. can you help me. see my question
gyrous
adrian i need you help..
gyrous
A: 

exact answer i want is this only

nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)

. instead of awk, nawk works perfectly and it's gzip file so use gzcat

gyrous