I have a few very large log files, and I need to parse them. Ease of implementation obviously points me to Perl and regex combo (in which I am a still novice). But what about speed? Will it be faster to implement it in C? Each log file is in the order of 2 GB.
In the past, I have found C to be faster, but not to the extent that the choice was a foregone conclusion.
Have you thought about using a generic Log Parser tool, such as Log Parser:
Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.
This site lists a few generic log parsers.
The Perl regex matcher is heavily optimized. This is where Perl shines, you should have no trouble working with a 2GB file in Perl and the performance should be easily comparable to the C version. By the way: Did you try to look for an already finished log parser? There are plenty of them.
I very much doubt C will be faster than Perl unless you were to hand-compile the RE.
By hand-compiling, I mean coding the finite state machine (FSM) directly rather than using the RE engine to compile it. This approach means you can optimize it for your specific case which can often be faster than relying on the more general-purpose engine.
But that's not something I'd ever suggest to anyone who hasn't had to write compilers or parsers before without the benefit of lex, yacc, bison or other similar tools.
The generalized engines, such as PCRE, are usually powerful and fast enough (for my purposes anyway, and they've been demanding in the past).
When using a general RE engine, it needs to be able to handle all sorts of cases whether it's written in C or Perl. When you think about which is faster, you only have to compare what the RE engines are written in for both cases (hint: the Perl RE engine is not written in Perl).
They're both written in C so you should find very little difference in terms of the matching speed.
You may find differences in the support code around the REs but that will be minimal, especially if it's a simple read/match/output loop.
Perl obviously has some overhead compared to C. But this overhead may be negligible if you spend most of the time inside the Perl Regex functions implemented in C.
If you actually need to use regexes, then the Perl regex engine is hard to beat. However, many parsing problems can be solved more efficiently without them - for example, if you just need to split a line at a certain character, in which case C will probably be faster.
If performance is of overriding importance, then you should try both languages, and measure the speed difference. Otherwise, simply use the one you are most comfortable with.
I'm guessing (in lieu of benchmarking against Alphaneo's actual data, which I don't have) that I/O processing is going to be the bounding factor here. And I'd expect a Perl implementation on a perl with usefaststdio enabled to match or beat a basic C implementation, but to be noticeably slower without usefaststdio. (usefaststdio was on by default in perl 5.8 and earlier for most platforms and off by default in perl 5.10.)
Is speed really a factor here? Do you actually care whether parsing is done after 5 or 10 minutes?
Go for the language or tool that offers the best parsing features and that you are most familar with.
Part of this depends on how the parsing will be integrated into an application. If the application IS the parser, then Perl will be fine, just due to that it will handle everything surrounding it too, but if it's integrated DIRECTLY into a larger application, then it's fully possible that you may want to look into something like Lex (or Flex these days): http://en.wikipedia.org/wiki/Lex_(software) This tool generates the parser for you, and you can integrate the C/C++ code directly into your software.
As for speed considerations, I agree with most other responders here that the maturity of the library used will be the dominant factor, and Perl's is VERY mature. I don't know how mature some of the other libraries are (like the regex one available for C++ from Boost), but being as most of your processing time will be in the library, language concerns are likely secondary.
Bottom line: use what you're most comfortable with, and do as much work as possible inside the library, as it's almost-always faster than what you can produce yourself, in any language.
- A naively written Perl regex-based parser will be faster than a naively written C regex-based parser.
- A well-written Perl regex-based parser will be vastly faster than a naively written C regex-based parser.
- A well-written C regex-based parser will be marginally faster than a well-written Perl regex-based parser. (It will also be twice as hard to write and ten times harder to debug.)
Yes, you can make a much faster parser in C if you know what you are doing.
However, for the vast majority of people a smarter thing to worry about would be ease of implementation and maintenence of the code. A fast parser that you can't get to work right does nobody any good.
If you are parsing logs in Apache common log format, visitors, which is written in C will beat any comparable perl log parser by at least a factor 2.
So find existing parsers and benchmark them if the log format is common.
A properly written log parser in C will always be significantly faster than a properly written log parser in Perl, based on my past experiences.
If you are equally skilled in C and Perl, the answer is simple:
- Write it in Perl.
- If it is too slow, profile it and fix it.
- If it is still too slow, and the problem is excessive CPU or RAM usage, consider writing it in C.
Generally, I'd say this applies unless you are some sort of C godlet that can deftly manipulate the foundations of reality through puissant manipulation of pointers and typecasts.
Seriously, the regex implementation in perl is very fast, flexible and well tested. Any code you write may be fast and flexible, but it ca never be as thoroughly tested.
Since you are new to Perl and regex, it is important to remember that there are resources that can provide you with excellent help if you need it. There are even some nice tutorials in the fine manual.
Whatever you do, don't do this:
for my $line ( <$log> ) {
# parse line here.
}
You will read the whole log file into memory and it will take forever as your system swaps and swaps (and possibly crashes).
Instead use a while loop:
while ( my $line = <$log> ) {
# parse line here.
}
If you are going to be applying the same regular expression to every line, don't forget that you can greatly optimize the execution by appending the /o flag to the pattern, i.e.
if(/[a-zA-Z]+/o)
This will cause the expression to be compiled internally only once and for that result to be subsequently re-used, instead of on every successive loop iteration.
Armed with that enhancement, I would be very surprised if your Perl parser didn't walk all over whatever C implementation you'd feasibly be able to come up with in a realistic amount of time.