views:

83

answers:

2

Heyho, I've got a big bunch of loglines (more or less without documentation) and need to parse the lines. The parsing itself won't be a big problem, but first I need to know how many different kinds of lines are inside the files. Besides the fact that I've got really different lines like short errors, up to bigger lines which are only different in some fields like full username from the certificate they are using and some numbers like transfersize & time.

Getting a generated pattern with the differences / common stuff about a group of same lines would be nice.

Are there any tools around which will do the trick and analyse a big bunch of input and output the common stuff within all the lines?

Thanks :).

A: 

I don't know of any such tool. I'd probably just open up the file, sort it, and delete duplicate types of messages.

For example, if you had:

Error while writing char 45
Error while writing char 8
Error while writing char 903

I'd reduce it to

Error while writing char #

I'm not sure the tool you're requesting is feasible. Consider these error messages:

I/O Error: couldn't open file abc.txt
I/O Error: failed while writing to xyz.txt
Database Error: couldn't open database MyDB

What algorithm could tell you that the 2nd error is a variation on the 1st, but the 3rd error is a new type?

I think you'll have to do it manually, but sorting will make it easier.

Jeremy Stein
nope, there are different length of the fields, other values and so on, no duplicates - every line is uniq :/.
mj
Understood. I meant duplicate types. I'll update my answer.
Jeremy Stein
A: 

I can't think of a way to write this in a regular expression.

However, what about copying and pasting the logs into Excel and then sorting them? I'd think it should be easier to identify how many different types of messages there are that way.

Or you could import into Access or SQL or something and then you could use SELECT DISTINCT to trim down the results even further.

Steve Wortham