views:

69

answers:

3

I am pulling data from a bzip2 stream within a C application. As chunks of data come out of the decompressor, they can be written to stdout:

fwrite(buffer, 1, length, stdout);

This works great. I get all the data when it is sent to stdout.

Instead of writing to stdout, I would like to process the output from this statement internally in one-line-chunks: a string that is terminated with a newline character \n.

Do I write the output of the decompressor stream to another buffer, one character at a time, until I hit a newline, and then call the per-line processing function? Is this slow and is there a smarter approach? Thanks for your advice.

EDIT

Thanks for your suggestions. I ended up creating a pair of buffers that store the remainder (the "stub" at the end of an output buffer) at the beginning of a short line buffer, each time I pass through the output buffer's worth of data.

I loop through the output buffer character by character and process a newline-line's worth of data at a time. The newline-less remainder gets allocated and assigned, and copied to the next stream's line buffer. It seems like realloc is less expensive than repeated malloc-free statements.

Here's the code I came up with:

char bzBuf[BZBUFMAXLEN];
BZFILE *bzFp;
int bzError, bzNBuf;
char bzLineBuf[BZLINEBUFMAXLEN];
char *bzBufRemainder = NULL;
int bzBufPosition, bzLineBufPosition;

bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0); /* http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzcompress-init */ 

if (bzError != BZ_OK) {
    BZ2_bzReadClose(&bzError, bzFp);   
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be retrieved\n\n");
    return -1;          
}

bzError = BZ_OK;
bzLineBufPosition = 0;
while (bzError == BZ_OK) {

    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, sizeof(bzBuf));

    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
        if (bzBufRemainder != NULL) {
            /* fprintf(stderr, "copying bzBufRemainder to bzLineBuf...\n"); */
            strncpy(bzLineBuf, bzBufRemainder, strlen(bzBufRemainder)); /* leave out \0 */
            bzLineBufPosition = strlen(bzBufRemainder);
        }

        for (bzBufPosition = 0; bzBufPosition < bzNBuf; bzBufPosition++) {
            bzLineBuf[bzLineBufPosition++] = bzBuf[bzBufPosition];
            if (bzBuf[bzBufPosition] == '\n') {
                bzLineBuf[bzLineBufPosition] = '\0'; /* terminate bzLineBuf */

                /* process the line buffer, e.g. print it out or transform it, etc. */
                fprintf(stdout, "%s", bzLineBuf);

                bzLineBufPosition = 0; /* reset line buffer position */
            }
            else if (bzBufPosition == (bzNBuf - 1)) {
                bzLineBuf[bzLineBufPosition] = '\0';
                if (bzBufRemainder != NULL)
                    bzBufRemainder = (char *)realloc(bzBufRemainder, bzLineBufPosition);
                else
                    bzBufRemainder = (char *)malloc(bzLineBufPosition);
                strncpy(bzBufRemainder, bzLineBuf, bzLineBufPosition);
            }
        }
    }
}

if (bzError != BZ_STREAM_END) {
    BZ2_bzReadClose(&bzError, bzFp);
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be uncompressed\n\n");
    return -1;  
} else {   
    BZ2_bzReadGetUnused(&bzError, bzFp, 0, 0);
    BZ2_bzReadClose(&bzError, bzFp);
}

free(bzBufRemainder);
bzBufRemainder = NULL;

I really appreciate everyone's help. This is working nicely.

+2  A: 

I don't think there's a smarter approach (except finding an automata library that already does this for you). Be careful with allocating proper size for the "last line" buffer: if it cannot handle arbitrary length and the input comes from something accessible to third parties, it becomes a security risk.

Pavel Radzivilovsky
A: 

I think you should copy chunks of characters to another buffer until the latest chunk you write contains a new line character. Then you can work on the whole line.

You can save the rest of the buffer (after the '\n') into a temporary and then create a new line from it.

Opera
+1  A: 

This would be easy to do using C++'s std::string, but in C it takes some code if you want to do it efficiently (unless you use a dynamic string library).

char *bz_read_line(BZFILE *input)
{
    size_t offset = 0;
    size_t len = CHUNK;  // arbitrary
    char *output = (char *)xmalloc(len);
    int bzerror;

    while (BZ2_bzRead(&bzerror, input, output + offset, 1) == 1) {
        if (offset+1 == len) {
            len += CHUNK;
            output = xrealloc(output, len);
        }
        if (output[offset] == '\n')
            break;
        offset++;
    }

    if (output[offset] == '\n')
        output[offset] = '\0';  // strip trailing newline
    else if (bzerror != BZ_STREAM_END) {
        free(output);
        return NULL;
    }

    return output;
}

(Where xmalloc and xrealloc handle errors internally. Don't forget to free the returned string.)

This is almost an order of magnitude slower than bzcat:

lars@zygmunt:/tmp$ wc foo
 1193  5841 42868 foo
lars@zygmunt:/tmp$ bzip2 foo
lars@zygmunt:/tmp$ time bzcat foo.bz2 > /dev/null

real    0m0.010s
user    0m0.008s
sys     0m0.000s
lars@zygmunt:/tmp$ time ./a.out < foo.bz2 > /dev/null

real    0m0.093s
user    0m0.044s
sys     0m0.020s

Decide for yourself whether that's acceptable.

larsmans
I have a bunch of bz2 streams concatenated in one very large file. I'm trying to write a self-contained application to unpack one stream among many. This is very helpful, thanks!
Alex Reynolds