views:

1884

answers:

10

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Each log line is of the form

cust_name time_start time_end (IP or URL )*

So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there is more than 1, then they are separated by semicolons.

I need a way to parse this line and read it into a data structure. time_start or time_end could be either system time or GMT. cust_name could also have multiple strings separated by spaces.

I can do this by reading character by character and essentially writing my own parser. Is there a better way to do this ?

+1  A: 

Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.

dirkgently
A: 

You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.

Pierre
+1  A: 

Consider using a Regular Expressions library...

Andrew Flanagan
And next thing you know, we have **another** how do I parse URLs question.
dirkgently
+4  A: 

Why do you want to do this in C++? It sounds like an obvious job for something like perl.

anon
Sure. If he's just doing this job. But the context might be an existing code with some other primary task...
dmckee
He's interested in performance, and a custom C++ parser will blow the doors off a Perl parser for speed of execution (but *not* speed of development).
David Thornley
David, that's not necessarily true. It can very easily backfire on him (in terms of performance) if he stores the resulting gigantic data structure in memory! C++ won't help there.
hasen j
@david untrue - the regex engine in perl has had untold man years spent on it - you are very unlikely to do as good a job with hand-rolled C++ code
anon
I am using C++ because this is part of a full application where thedata structures I create are used by the rest of the app.
duli
+6  A: 

Maybe Boost RegExp lib will help you. http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html

bb
I up-modded, but remember, "Those who attempt to solve a problem using regular expressions now have two problems."
Matt Cruikshank
:) nice quote.But anyway RegExp good solution for small or not significant tasks.
bb
+5  A: 

I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.

Kristo
+1  A: 

for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194

+3  A: 

Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips

Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).

begray
+1  A: 

UPDATE changed answer drastically!

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!

The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.

Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,

Storing huge data structure in memory is very inefficient, no matter what language you're using!

What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.

If you do that, you've already solved the major bottleneck.

For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: http://stackoverflow.com/questions/536148/c-string-parsing-python-style

In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.

see this answer for example code.

Alternatively, (I didn't want to delete this part!!) If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still

Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf

It's a worth while read. At slide 31 he deals with what seems to be something very similar to what you're trying to do.

It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.

hasen j
I think you are conflating a good idea (Process one line at a time) with one that depends on the context (don't use c++ for this). Moreover, the OP notes in the comments to another answer that he's doing this in an existing c++ code. Nonetheless, +1 for the one-at-a-time point.
dmckee
good point! I changed the answer. but in my defense though, he mentioned the existing C++ app quite a while after I posted my answer
hasen j
A: 

The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.

Mike Dunlavey