tags:

views:

172

answers:

1

I'm doing stuff related to parsing huge globs of textfiles, and was testing what input method to use.

There is not much of a difference using c++ std::ifstreams vs c FILE,

According to the documentation of zlib, it supports uncompressed files, and will read the file without decompression.

I'm seeing a difference from 12 seconds using non zlib to more than 4 minutes using zlib.h

This I've tested doing multiple runs, so its not a disk cache issue.

Am I using zlib in some wrong way?

thanks

#include <zlib.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>
#define LENS 1000000


size_t fg(const char *fname){
  fprintf(stderr,"\t-> using fgets\n");
  FILE *fp =fopen(fname,"r");
  size_t nLines =0;
  char *buffer = new char[LENS];
  while(NULL!=fgets(buffer,LENS,fp))
    nLines++;

  fprintf(stderr,"%lu\n",nLines);
  return nLines;
}

size_t is(const char *fname){
  fprintf(stderr,"\t-> using ifstream\n");
  std::ifstream is(fname,std::ios::in);
  size_t nLines =0;
  char *buffer = new char[LENS];
  while(is. getline(buffer,LENS))
    nLines++;

  fprintf(stderr,"%lu\n",nLines);
  return nLines;
}

size_t iz(const char *fname){
  fprintf(stderr,"\t-> using zlib\n");
  gzFile fp =gzopen(fname,"r");
  size_t nLines =0;
  char *buffer = new char[LENS];
  while(0!=gzgets(fp,buffer,LENS))
    nLines++;

  fprintf(stderr,"%lu\n",nLines);
  return nLines;
}

int main(int argc,char**argv){
  if(atoi(argv[2])==0)
    fg(argv[1]);
  if(atoi(argv[2])==1)
    is(argv[1]);
  if(atoi(argv[2])==2)
    iz(argv[1]);

}
+1  A: 

I guess you are using zlib-1.2.3. In this version, gzgets() is virtually calling gzread() for each byte. Calling gzread() in this way has a big overhead. You can compare the CPU time of calling gzread(gzfp, buffer, 4096) once and of calling gzread(gzfp, buffer, 1) for 4096 times. The result is the same, but the CPU time is hugely different.

What you should do is to implement buffered I/O for zlib, reading ~4KB data in a chunk with one gzread() call (like what fread() does for read()). The latest zlib-1.2.5 is said to be significantly improved on gzread/gzgetc/.... You may try that as well. As it is released very recently, I have not tried personally.

EDIT:

I have tried zlib-1.2.5 just now. gzgetc and gzgets in 1.2.5 are much faster than those in 1.2.3.