tags:

views:

359

answers:

12

My company's proprietary software generates a log file that is much easier to use if it is parsed. The log parser we all use was written by another employee as a side project, and it has horrible performance.

These log files can grow to 10s of megabytes very quickly, and the parser we currently use has issues if a log file is bigger than 1 megabyte.

So, I want to write a program that can parse this massive amount of text in the shortest amount of time possible. We use Windows exclusively, so running on Windows is a must. Our current implementation runs on a local web server, and I'm convinced that running it as an application would have to be faster.

All suggestions will be helpful. Thanks.

EDIT: My ultimate goal is to parse the text and display it in a much more user friendly manner with colors and such. Can you do this with Perl and Python? I know you can do this with Java and C++. So, it will function like Notepad where you open a log file, but on the screen you display the user-friendly format instead of the raw file.

EDIT: So, I cant choose the best answer, and that was to choose a language that can best display what I'm going for, and then write the parser in that. Also, using ANTLR will probably make this process much easier. I changed the original question, since I guess I didn't ask what I was really looking for. Thanks everyone!

+6  A: 

I would suggest using Python or Perl. Parsing large text files with regular expressions is really fast.

compie
+2  A: 

I believe perl is considered a good choice to parse text.

Gratzy
A: 

Perl is good for text processing.

A number of very good text processing programs have been written in Perl. Ack (a grep replacement) is one.

David Johnstone
downvote for what?
David Johnstone
A: 

Sounds like a job for Perl, much as I don't particularly care for it as a language myself. ActivePerl is a reasonable distribution of Perl for Windows.

Donal Fellows
+3  A: 

I've used both Python and Perl. Perl is a more natural fit for this but can be hard to maintain. Python will do it just as well and is easier to read. Go for Python.

j0rd4n
But all the $@% are so beautiful! Go for perl!
Jefromi
@Jefromi - Ha! There's nothing like coming back to 200 lines of symbol soup months later trying to figure out what the heck you were thinking. =)
j0rd4n
I added some information to the post to clarify how I'm going to be using the parsed text. I want to have a GUI that will display the log, but in a friendly format. I don't think I've ever seen a Windows GUI app written using Perl or Python, but I know very little about them.
HenryAdamsJr
@j0rd4n: That just means you get to have all the fun twice!
Jefromi
I've never used it, but the Tk GUI library comes installed with Python on all platforms, including Windows.
ptomato
Duncan
@Duncan, then you should try using Python. with Python, you write less regexes, because Python's string manipulation capabilities is often more than enough to do the job.
ghostdog74
@Duncan - Write Python and you'll never go back. I was a die-hard Perl fan but what ghostdog74 said is indeed true.
j0rd4n
@ghostdog74,@j0rd4a: Perhaps I am too old and set in my ways ... or more accurately a vast bulk of the code I have to maintain is written in Perl and none of it in Python. So there is no compelling reason to learn it as yet. Although, I'm sure it would help understanding some of the later O'Reilly books :)
Duncan
@Duncan - Totally understand. :)
j0rd4n
+2  A: 

Maybe a finished product such as the MS LogParser (usage podcast here) may do what you need and it's free.

Lucero
I definitely would recommend looking at existing free or commercial products to solve the problem, no need to reinvent the wheel. Splunk is a popular log parsing and analysis tool that can accept arbitrary input: http://www.splunk.com/base/Documentation/latest/Admin/WhatSplunkCanMonitor
Greg Bray
Also Apache ChainSaw.
Billy ONeal
A: 

c/c++ or java... for c/c++ i have snippet that might help you:

FILE *f = fopen(file, "rb");
if(f == NULL) {
    return DBDEMON_OPEN_ERROR; // open fail
}

for(int i = 0; feof(f) == 0; i++)

{

fscanf(f,"%d %s %s %c\n",  &db[i].id, &db[i].name[0], &db[i].uid[0], &db[i].priviledge);

db_size++;

}

fclose(f);

this is reading a file with the following format:

int string string char

1 SOMETHING ANYTHING Z

to a struct define as follows:

typedef struct {

    unsigned int    id;
    char        name[DBDEMON_NAME_MAXSIZE];
    char        uid[DBDEMON_UID_MAXSIZE];
    char        priviledge;
} DATABASE;

Use fscanf with care, since no types are checked, etc, it can result in errors. But I think this is pretty efficient.

luis
dude... you forgot some spaces on some of those lines...
SeanJA
I am a C/C++ advocate -- and even I'd not call them great languages for text processing.
Billy ONeal
@Billy - So, C++ doesn't process text well? Would that be balanced out by how it can easily create a Windows GUI, or not?
HenryAdamsJr
No, it does not. Strings are not native types on C++, and the language does not have built-in constructs like regular expressions, `starts_with`, case insensitive comparisons, substring, trimming, splitting, etc. Almost every other languages includes these as part of the language.
Billy ONeal
luis
+7  A: 

You should use the language that YOU know... Unless you have so much time available to complete the project that you can also spend the time learning a new language.

David
This is ALWAYS the correct answer when the question is "What language should I use to do X?" Even if the language isn't great for what you're doing, if you don't know a better one you're better off sticking to what you know for serious projects.
Billy ONeal
That's a great suggestion, and if this was needed within a certain timeframe, I would agree, but I was going to use this project as an excuse to learn something new. Reading all of the answers makes it seem like the language isn't going to make this faster or slower to a great extent. I'm currently leaning towards C++ since I know I can create a Windows GUI with it, and I want to add it to my repertoire.
HenryAdamsJr
+4  A: 

Whatever language your coworker used.

(I could tell you that any macro assembler will let you write code that would rip through your data, but seriously, are you going to spend months writing assembly just to save a few seconds of CPU time? Rewriting a program is fun but it's not practical.)

Whip out your profiler, point it at your horribly performing log parser, and fix the performance problems. If it's a common language, there will be people here who can help.

Ken
It wouldn't save a few seconds. If I do it right, it will literally save minutes. With the current implementation, if the file is sufficiently big, it won't return at all. I feel that his implementation is wrong from the ground up, and I don't have access to the source code, anyway.
HenryAdamsJr
+8  A: 

Hmmm, "go with what you know" was a good answer. Perl was designed for this sort of thing (but imo is well suited for simple parsing, but I'd personally avoid it for complex projects).

If it gets even a little complex, why not use a proper syntax and grammar set-up?

Lex & Yacc (or Flex & Bison) spring to mind, but personally I would always reach for Antlr http://www.antlr.org/

Define various "words" in terms of patterns (syntax), and rules to combine those words (grammar) and Antlr will spit out a program to parse your input (you can have the program in Java, C, C++ and more (you are worried about parse time, so choose a compiled language, of course)).

I personally find it tedious to hand-craft parsers, and even more tedious to debug them, but AntlrWorks is a lovely IDE which really makes it a piece of cake ...

That bit at the bottom is defining a grammar rule.

If you mess up your grammar rules, you will be informed. This is not the case with hand-crafted parsers, where you just scratch your body part and wonder about the "strange results"...

Check it out. Even if you think your project is trivial now, it may well grow. And if you have any interest in parsing you do owe it to yourself to at least be familiar with lex/yacc, but especially Antlr(Works)

Mawg
I will definitely look into this. It seems that ANTLR would be extremely no matter what language I use.
HenryAdamsJr
A: 

I'd suggest Perl. It was practically built for parsing log files. As for output I agree with ghostdog74, HTML is the way to go. Perl has dozens of modules that allow you to build and/or template HTML.

I'd parse out the data using regular expressions, then use Template::Toolkit (on CPAN) to create nice pages using HTML and CSS templates.

Matthew S
+1  A: 

Parse this massive amount of text in the shortest time possible.

Consider the PADS Project from AT&T. It's a special-purpose language, compatible with C, that's designed exactly for high-speed parsing of log files and other ad hoc data formats. There's even a feature where it can try to learn your log format from examples, although I don't know if that has hit production yet. The people behind the project are really smart, and it's had a big impact within the phone company. PADS gives very high performance on data streams that produce gigabytes. Joe Bob says check it out.

If "massive text in the shortest time possible", Perl and Python are not the answer. But if you need to whip up something not too slow, and it's OK to take longer, Perl and Python could be OK. Tems of megabytes is not actually that big.

Norman Ramsey