tags:

views:

910

answers:

11

Hello,

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:

TIMESTAMP=20090101000000;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000100;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000152;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;...

I need to strip this into the table:

TIMESTAMP | PARAM1 | PARAM2 | PARAM3

The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?

Best regards,

Arthur

+3  A: 

Lex handles this sort of things amazingly well.

Nikolai N Fetissov
I like it when someone uses a tool *well* for something it wan't designed to do.
BCS
+7  A: 

Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.

Dave
No need for esoteric solutions. 5gb is not THAT big. SATA2 is 300 mb/sec so it should take around 20 secs.
ebo
Sure, if you only want to do this once.
Dave
@ebo: The 300mbps number is a theoretical maximum. Typical hard drives actually get 75-150mbps if there isn't any contention. If you properly set up a RAID array, then you can actually bump up against the interface bandwidth limit.
Mr Fooz
A: 

You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).

+2  A: 

But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?

Try AWK

Marcin Cylke
This would require using awk's RS variable (=~ Perl's $/) since the file does not contain newlines. What effect does that have on performance, given that "man awk" here says "RS is a regular expression [when not a single character]"?
Martin Carpenter
A: 

If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.

I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.

Ryan Emerle
no newlines...
BCS
+2  A: 

The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.

The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.

Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.

Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.

Clay Lenhart
Why PushbackInputStream? Wrap any InputStream in a InputStreamReader (specifying the correct encoding of course) and a BufferedReader. Then call readLine().
Joachim Sauer
Sorry, I missed the "no line-breaks" part.
Joachim Sauer
+1  A: 

Something like this should work.

use strict;
use warnings;

my $filename = shift @ARGV;

open my $io, '<', $filename or die "Can't open $filename";

my ($match_buf, $read_buf, $count);

while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
    $match_buf .= $read_buf;
    while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
        my ($timestamp, @params) = ($1, $2, $3, $4);
        print $timestamp ."\n";
        last unless $timestamp;
    }
}
Peter Stuifzand
I'm not a perl programmer, but it looks like you were reading the file in 1024b chunks. Wouldn't this miss timestamps at the end of the chunks, e.g. at pos 1020? The first chunk would only contain "TIME", the second start with "STAMP...", so the regex wouldn't match.
Niki
Thank you, you're right. I thought about this will writing the program. While I was testing this piece of code it didn't show that problem. I made the edit to fix this bug. It now saves the bit that was left over.
Peter Stuifzand
+1  A: 

This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:

#include <stdio.h>
#include <err.h>

int
main(int argc, char **argv)
{
        const char      *filename = "noeol.txt";
        FILE            *f;
        char            buffer[1024], *s, *p;
        char            line[1024];
        size_t          n;
        if ((f = fopen(filename, "r")) == NULL)
                err(1, "cannot open %s", filename);
        while (!feof(f)) {
                n = fread(buffer, 1, sizeof buffer, f);
                if (n == 0)
                       if (ferror(f))
                               err(1, "error reading %s", filename);
                       else
                               continue;
                for (s = p = buffer; p - buffer < n; p++) {
                        if (*p == ';') {
                                *p = '\0';
                                strncpy(line, s, p-s+1);
                                s = p + 1;
                                if (strncmp("TIMESTAMP", line, 9) != 0)
                                        printf("\t");
                                printf("%s\n", line);
                        }
                }
        }
        fclose(f);
}
dwc
A: 

I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS

LB
+3  A: 

This presentation about the use of Python generators blew my mind: http://www.dabeaz.com/generators-uk/

David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions

lines = lines_from_dir("access-log*","www")
log   = apache_log(lines)
for r in log:
    print r

which can then be used for all sorts of querying:

stat404 = set(r['request'] for r in log
                if r['status'] == 404)

large = (r for r in log
           if r['bytes'] > 1000000)
for r in large:
    print r['request'], r['bytes']

He also shows that performance compares well to the performance of standard unix tools like grep, find etc. Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.

(The code examples above are copied from the presentation slides.)

Rob
Whee, functional programming!
Thorbjørn Ravn Andersen
+1  A: 

Sounds like a job for sed:

sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output
Joachim Sauer