views:

31

answers:

2

Hello, I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2 which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about corrupted blocks.

What's the best way to merge 2 bzip2'ed files without decompressing them?

+1  A: 

You could compress (well, store) them both into a new bz2? It'd mean you'd have to do 3 decompressions to get the contents of the 2 archives, but might work with your scenario.

Dave
This is very nice idea, much better if bzip2 would be smart, and only 1 decompression would be needed.
Wojtek
+1  A: 

Handling concatenated bzip is fixed on trunk, or should be: https://issues.apache.org/jira/browse/HADOOP-4012. There are examples of it working: https://issues.apache.org/jira/browse/MAPREDUCE-477?focusedCommentId=12871993&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12871993 Make sure you're running a recent version of Hadoop and you should be fine.

Jakob Homan
Bzipped files are split correctly, but I still can't figure out how to run map task on concatenated files. (But after decompressing all of them, then `cat`, then compressing this big input file works)
Wojtek