views:

1575

answers:

14

I have a few very large log files, and I need to parse them. Ease of implementation obviously points me to Perl and regex combo (in which I am a still novice). But what about speed? Will it be faster to implement it in C? Each log file is in the order of 2 GB.

+4  A: 

In the past, I have found C to be faster, but not to the extent that the choice was a foregone conclusion.

Have you thought about using a generic Log Parser tool, such as Log Parser:

Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.

This site lists a few generic log parsers.

Mitch Wheat
+17  A: 

The Perl regex matcher is heavily optimized. This is where Perl shines, you should have no trouble working with a 2GB file in Perl and the performance should be easily comparable to the C version. By the way: Did you try to look for an already finished log parser? There are plenty of them.

zoul
Log file parser is news to me, Thank you, I will search.
Alphaneo
+3  A: 

If you are proficient in Perl, use it. Otherwise, use AWK and SED.

Parsing text is not what you want to do with C.

voyager
Unless you have PCRE :-)
paxdiablo
+1 for AWK and SED reference. My hirerarchy of languages (go from left to right until you can handle the problem well) is grep -> sed -> awk -> full compiled language.
T.E.D.
+30  A: 

I very much doubt C will be faster than Perl unless you were to hand-compile the RE.

By hand-compiling, I mean coding the finite state machine (FSM) directly rather than using the RE engine to compile it. This approach means you can optimize it for your specific case which can often be faster than relying on the more general-purpose engine.

But that's not something I'd ever suggest to anyone who hasn't had to write compilers or parsers before without the benefit of lex, yacc, bison or other similar tools.

The generalized engines, such as PCRE, are usually powerful and fast enough (for my purposes anyway, and they've been demanding in the past).

When using a general RE engine, it needs to be able to handle all sorts of cases whether it's written in C or Perl. When you think about which is faster, you only have to compare what the RE engines are written in for both cases (hint: the Perl RE engine is not written in Perl).

They're both written in C so you should find very little difference in terms of the matching speed.

You may find differences in the support code around the REs but that will be minimal, especially if it's a simple read/match/output loop.

paxdiablo
Implementing an actual state machine ("hand-compiling the RE") is precisely what you would do in C, so it would almost certianly be faster. You also have a lot more control over file buffering behavior in C, which is going to be the main determinant of speed no matter what language is used.
T.E.D.
Faster to run but slower to write :-). I'd tend to use PCRE or something similar as a first attempt. If performance becomes an issue, then I'd consider creating my own FSM.
paxdiablo
Is writing a custom FSM likely to help in a situation like this? I mean, is CPU time generally going to be the bottleneck in this situation, or file I/O? Assuming, just for simplicity's sake, a single processor and a typical consumer hard drive.
intuited
+4  A: 

Perl obviously has some overhead compared to C. But this overhead may be negligible if you spend most of the time inside the Perl Regex functions implemented in C.

Unknown
+12  A: 

If you actually need to use regexes, then the Perl regex engine is hard to beat. However, many parsing problems can be solved more efficiently without them - for example, if you just need to split a line at a certain character, in which case C will probably be faster.

If performance is of overriding importance, then you should try both languages, and measure the speed difference. Otherwise, simply use the one you are most comfortable with.

anon
+7  A: 

I'm guessing (in lieu of benchmarking against Alphaneo's actual data, which I don't have) that I/O processing is going to be the bounding factor here. And I'd expect a Perl implementation on a perl with usefaststdio enabled to match or beat a basic C implementation, but to be noticeably slower without usefaststdio. (usefaststdio was on by default in perl 5.8 and earlier for most platforms and off by default in perl 5.10.)

ysth
Thanks for the insight on the IO bottleneck. It really matters.
Alphaneo
+7  A: 

Is speed really a factor here? Do you actually care whether parsing is done after 5 or 10 minutes?

Go for the language or tool that offers the best parsing features and that you are most familar with.

innaM
On completion of a test, a log is generated, this log file is parsed for any issues. And if there is any issue, we immediately start working on the issue. It will be really helpful even if it saves say a few seconds.
Alphaneo
+3  A: 

Part of this depends on how the parsing will be integrated into an application. If the application IS the parser, then Perl will be fine, just due to that it will handle everything surrounding it too, but if it's integrated DIRECTLY into a larger application, then it's fully possible that you may want to look into something like Lex (or Flex these days): http://en.wikipedia.org/wiki/Lex_(software) This tool generates the parser for you, and you can integrate the C/C++ code directly into your software.

As for speed considerations, I agree with most other responders here that the maturity of the library used will be the dominant factor, and Perl's is VERY mature. I don't know how mature some of the other libraries are (like the regex one available for C++ from Boost), but being as most of your processing time will be in the library, language concerns are likely secondary.

Bottom line: use what you're most comfortable with, and do as much work as possible inside the library, as it's almost-always faster than what you can produce yourself, in any language.

Kevin
+13  A: 
  • A naively written Perl regex-based parser will be faster than a naively written C regex-based parser.
  • A well-written Perl regex-based parser will be vastly faster than a naively written C regex-based parser.
  • A well-written C regex-based parser will be marginally faster than a well-written Perl regex-based parser. (It will also be twice as hard to write and ten times harder to debug.)
chaos
Who said he was using a regex-based parser in C? If speed is a concern (thus driving someone to C), why on earth would someone use slow regex parsing?
T.E.D.
(1) a regex is by far the fastest way of parsing things... especially logs.(2) a well-written c regex-based parser will be 100x harder to write/debug than a perl parser... because perl regexen "Just Work (TM)"
Massa
Well, it isn't faster than a FSM designed to parse your particular case. It might be the fastest mechanism with any generality, so it's faster to *write* a regex than to write an equivalent parser. In debugging difficulty, I assumed a C regex library would be used, not a custom regex engine built.
chaos
+3  A: 

Yes, you can make a much faster parser in C if you know what you are doing.

However, for the vast majority of people a smarter thing to worry about would be ease of implementation and maintenence of the code. A fast parser that you can't get to work right does nobody any good.

T.E.D.
+1  A: 

If you are parsing logs in Apache common log format, visitors, which is written in C will beat any comparable perl log parser by at least a factor 2.

So find existing parsers and benchmark them if the log format is common.

A properly written log parser in C will always be significantly faster than a properly written log parser in Perl, based on my past experiences.

obecalp
+12  A: 

If you are equally skilled in C and Perl, the answer is simple:

  1. Write it in Perl.
  2. If it is too slow, profile it and fix it.
  3. If it is still too slow, and the problem is excessive CPU or RAM usage, consider writing it in C.

Generally, I'd say this applies unless you are some sort of C godlet that can deftly manipulate the foundations of reality through puissant manipulation of pointers and typecasts.

Seriously, the regex implementation in perl is very fast, flexible and well tested. Any code you write may be fast and flexible, but it ca never be as thoroughly tested.

Since you are new to Perl and regex, it is important to remember that there are resources that can provide you with excellent help if you need it. There are even some nice tutorials in the fine manual.

Whatever you do, don't do this:

for my $line ( <$log> ) {
    # parse line here.
}

You will read the whole log file into memory and it will take forever as your system swaps and swaps (and possibly crashes).

Instead use a while loop:

while ( my $line = <$log> ) {
    # parse line here.
}
daotoad
I'm no Perl expert. Why exactly for snippet is reading the whole file? Does it have anything to do with the fact that for has to know in advance how many times to spin, but while checks every time?
Ignas Limanauskas
@Ignas, I missed your comment, so you may never see this. In case you find your way back here: `for` reads evaluates the contents of the parens in list context. `<>` in list context grabs the whole file. `while` evaluates the code in parens in scalar context. In scalar context, `<>` reads one line at a time.
daotoad
A: 

If you are going to be applying the same regular expression to every line, don't forget that you can greatly optimize the execution by appending the /o flag to the pattern, i.e.

if(/[a-zA-Z]+/o)

This will cause the expression to be compiled internally only once and for that result to be subsequently re-used, instead of on every successive loop iteration.

Armed with that enhancement, I would be very surprised if your Perl parser didn't walk all over whatever C implementation you'd feasibly be able to come up with in a realistic amount of time.

Alex Balashov
This isn't true (any more). For a long time now, the /o option to regular expressions has been mostly superfluous. It only ever has an effect if you interpolate variables into the regular expression. Otherwise, it'll only be compiled once anyway. See "perldoc perlre".
tsee
I agree. It sounded like the author wanted to interpolate variables.
Alex Balashov