views:

242

answers:

6

I'm combing a webapp's log file for statements that stand out.

Most of the lines are similar and uninteresting. I'd pass them through Unix uniq, however that filters nothing, as all the lines are slightly different: they all have a different timestamp, similar statements might print a different user ID, etc.

What's a way and/or tool to get just the lines that are notably different from any other? (But, again, not precise duplicates)

I was thinking about playing with Python's difflib but that seems geared toward diffing two files, rather than all pairs of lines in the same file.

[EDIT]

I assumed the solution would give a uniqueness score for each line. So by "notably different" I meant, I choose a threshold that the uniqueness score must exceed for any line to be included in the output.

Based on that, if there are other viable ways to define it, please discuss. Also, the method doesn't have to have 100% accuracy and recall.

[/EDIT]

Examples:

I'd prefer answers that are as general purpose as possible. I know I can strip away the timestamp at the beginning. Stripping the end is more challenging, as its language may be absolutely unlike anything else in the file. These sorts of details are why I shied from concrete examples before, but because some people asked...

Similar 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk

Similar 2:

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses

Different 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses

In the Different 1 case, I'd like both lines returned but not other lines like them. In other words, those 2 lines are distinct types (then I can later ask for only statistically rare line types). The edit distance is much bigger between those two, for one thing.

+4  A: 

Define "notably different". Then have a look at "edit distance" measures.

Charlie Martin
A good tool here, but he will have to decide edit distance from *what*. All lines against each other gets to be a big problem fast...
dmckee
@dmckee: performance should not really be an issue. Edit distance computations are extremely fast. See my note here: http://stackoverflow.com/questions/54797/how-do-you-implement-levenshtein-distance-in-delphi
JosephStyons
Performance can still be an issue -- all lines against each other is O(N^2), which can be a lot of comparisons if you're looking at a million-line logfile (as in, that's 10^12 distance calculations).
Rick Copeland
So, "How big is the file?" matters. Ten thousand lines is no problem: some arbitrary guess and a BOTE calculation gives a couple of minutes. Machs nicht. A lot more lines and it starts to add up...
dmckee
I don't think you really need to look at the whole quadratic set of lines, either. Log files are time ordered and formatted, it's easy to skim out the ones that are far too far apart.
Charlie Martin
Tell you what: why don't you show us some example lines that you think are similar, and notably different.
Charlie Martin
+2  A: 

You could try a bit of code that counts words, and then sorts lines by those having the least common words.

If that doesn't do the trick, you can add in some smarts to filter out time stamps and numbers.

Your problem is similar to an earlier question on generating summaries of news stories.

RossFabricant
+1, novel approach.
j_random_hacker
This is what I used, with the structure of @dmckee's solution. Very simple and effective!
Bluu
+2  A: 

I don't know a tool for you but if I were going to roll my own, I'd approach it like this:

Presumably the log lines have a well defined structure, no? So

  • parse the lines on that structure
  • write a number of very basic relevance filters (functions that just return a simple number from the parsed structure)
  • run the parsed lines through a set of filters, and cut on the basis of the total score
  • possibly sort the remaining lines into various bins by the results of more filters
  • generate reports, dump bins to files, or other output

If you are familiar with the unix tool procmail, I'm suggesting a similar treatment customized for your data.


As zacherates notes in the comments, your filters will typically ignore time stamps (and possibly IP address), and just concentrate on the content: for example really long http requests might represent an attack...or whatever applies to your domain.

Your binning filters might be as simple as a hash on a few selected fields, or you might try to do something with Charlie Martin's suggestion and used edit distance measures.

dmckee
This approach also allows you to determine uniqueness based on the structure of the log line (so you can ignore ip address, time stamp, etc. when batching lines).
Aaron Maenpaa
I love how general this is. This basic approach will work whether I'm working on a highly predictable, structured log file, or if I'm dealing with natural language.In the case of my log, I'm finding @rossfabricant's relevance filter of uncommon words quick, dirty, and very helpful.
Bluu
"I love how general this is." Me too. That why I stole it. All admiration should filter back to the procmail guys, or whoever they got it from.
dmckee
A: 

I wonder if you could just focus on the part that defines uniqueness for you. In this case, it seems that the part defining uniqueness is just the middle part:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
                    ^---------------------^ 

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
                    ^--------------------------------^

I would then compare exactly this part, perhaps using a regular expression (just the parenthesized group; how to access sub-matches like this is language dependent):

/^.{20}(\w+\s+[\w\.-]+\s+-\s+\w+)/
Svante
A: 
ja
+1  A: 

Perhaps you could do a basic calculation of "words the same"/"all words"?

e.g. (including an offset to allow you to ignore the timestamp and the word 'INFO', if that's always the same):

def score(s1, s2, offset=26):
    words1 = re.findall('\w+', s1[offset:])
    words2 = re.findall('\w+', s2[offset:])
    return float(len(set(words1) & set(words2)))/max(len(set(words1)), len(set(words2)))

Given:

>>> s1
'2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234'
>>> s2
'2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk'
>>> s3
'2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses'
>>> s4
'2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses'

This yields:

>>> score(s1,s2)
0.8571428571428571
>>> score(s3,s4)
0.75
>>> score(s1,s3)
0.066666666666666666

You've still got to decide which lines to compare. Also the use of set() may distort the scores slightly – the price of a simple algorithm :-)

John Fouhy