views:

140

answers:

2

Hi. I'm on a shared server with restricted disk space and i've got a gz file that super expands into a HUGE file, more than what i've got. How can I extract it "portion" by "portion (lets say 10 MB at a time), and process each portion, without extracting the whole thing even temporarily!

No, this is just ONE super huge compressed file, not a set of files please...


Hi David, your solution looks quite elegant, but if i'm readying it right, it seems like every time gunzip extracts from the beginning of the file (and the output of that is thrown away). I'm sure that'll be causing a huge strain on the shared server i'm on (i dont think its "reading ahead" at all) - do you have any insights on how i can make gunzip "skip" the necessary number of blocks?

A: 

Unfortunately I don't know of an existing Unix command that does exactly what you need. You could do it easily with a little program in any language, e.g. in Python, cutter.py (any language would do just as well, of course):

import sys
try:
  size = int(sys.argv[1])
  N = int(sys.argv[2])
except (IndexError, ValueError):
  print>>sys.stderr, "Use: %s size N" % sys.argv[0]
  sys.exit(2)
sys.stdin.seek((N-1) * size)
sys.stdout.write(sys.stdin.read(size))

Now gunzip <huge.gz | python cutter.py 1000000 5 > fifthone will put in file fifthone exactly a million bytes, skipping the first 4 million bytes in the uncompressed stream.

Alex Martelli
+3  A: 

If you're doing this with (Unix/Linux) shell tools, you can use gunzip -c to uncompress to stdout, then use dd with the skip and count options to copy only one chunk.

For example:

gunzip -c input.gz | dd bs=10485760 skip=0 count=1 >output

then skip=1, skip=2, etc.

David Gelhar
excellent insight David - its not exactly what i waas looking for, but i'll accept it nevertheless...
Dave