ansaurus

Question

how do I read a huge .gz file (more than 5 gig uncompressed) in c

Answer 1

+5 A:

gzip -cd compressed.gz | yourprogram

just go ahead and read it line by line from stdin as it is uncompressed.

EDIT: Response to your remarks about performance. You're saying reading STDIN line by line is slow compared to reading an uncompressed file directly. The difference lies within terms of buffering. Normally pipe will yield to STDIN as soon as the output becomes available (no, or very small buffering there). You can do "buffered block reads" from STDIN and parse the read blocks yourself to gain performance.

You can achieve the same result with possibly better performance by using gzread() as well. (Read a big chunk, parse the chunk, read the next chunk, repeat)

ssg 2009-12-27 11:21:35

Right on the second line of his question he wrote that he can read the file line-by-line just fine.

Lukáš Lalinský 2009-12-27 11:25:20

Lukas, with the only exception that this solution doesn't require an existing "uncompressed file". It just decompresses on the fly.

ssg 2009-12-27 11:27:10

Ah, I'm sorry, I misread that. I thought he knew how to read the compressed file line by line.

Lukáš Lalinský 2009-12-27 11:29:27

How do I do "buffered block read" from stdin

monkeyking 2009-12-28 04:38:52

You can tell stdio to do that for you by using setvbuf(stdin, ...);

ssg 2009-12-28 14:28:33

Thanks ssg! This will come in handy

monkeyking 2009-12-28 15:27:06

Answer 2

+5 A:

gzread only reads chunks of the file, you loop on it as you would using a normal read() call.

Do you need to read the entire file into memory ?

If what you need is to read lines, you'd gzread() a sizable chunk(say 8192 bytes) into a buffer, loop through that buffer and find all '\n' characters and process those as individual lines. You'd have to save the last piece incase there is just part of a line, and prepend that to the data you read next time.

You could also read from stdin and invoke your app like

zcat bigfile.gz | ./yourprogram

in which case you can use fgets and similar on stdin. This is also beneficial in that you'd run decompression on one processor and processing the data on another processor :-)

nos 2009-12-27 11:23:45

The posix read on linux is limited to 2.1gig, thats even on 64bit platforms. I spend 3 days realizing this fact.

monkeyking 2009-12-27 12:05:50

As nos says, you can simply do a streaming read of the compressed data. Your line processing reads from a buffered decompressor, which reads chunks at a time. There's no need to read gigs at a time, that simply wastes memory.

gavinb 2009-12-27 13:09:48

Answer 3

A:

I don't know if this will be an answer to your question, but I believe it's more than a comment:

Some months ago I discovered that the contents of Wikipedia can be downloaded in much the same way as the StackOverflow data dump. Both decompress to XML.

I came across a description of how the multi-gigabyte compressed dump file could be parsed. It was done by Perl scripts, actually, but the relevant part for you was that Bzip2 compression was used.

Bzip2 is a block compression scheme, and the compressed file could be split into manageable pieces, and each part uncompressed individually.

Unfortunately, I don't have a link to share with you, and I can't suggest how you would search for it, except to say that it was described on a Wikipedia 'data dump' or 'blog' page.

EDIT: Actually, I do have a link

pavium 2009-12-27 11:27:18

Thanks I guess bzip2 is a much better compressiontool, but all myinput files are .gz , and I cant change that.

monkeyking 2009-12-28 00:09:01

Okay, it was just a thought, anyway.

pavium 2009-12-28 01:16:23

ansaurus

tags:

views:

answers:

how do I read a huge .gz file (more than 5 gig uncompressed) in c

related questions