ansaurus

Question

The lines that stand out in a file, but aren't exact duplicates

Answer 1

+4 A:

Define "notably different". Then have a look at "edit distance" measures.

Charlie Martin 2009-04-20 19:35:08

A good tool here, but he will have to decide edit distance from *what*. All lines against each other gets to be a big problem fast...

dmckee 2009-04-20 19:51:14

@dmckee: performance should not really be an issue. Edit distance computations are extremely fast. See my note here: http://stackoverflow.com/questions/54797/how-do-you-implement-levenshtein-distance-in-delphi

JosephStyons 2009-04-20 20:05:21

Performance can still be an issue -- all lines against each other is O(N^2), which can be a lot of comparisons if you're looking at a million-line logfile (as in, that's 10^12 distance calculations).

Rick Copeland 2009-04-20 20:06:34

So, "How big is the file?" matters. Ten thousand lines is no problem: some arbitrary guess and a BOTE calculation gives a couple of minutes. Machs nicht. A lot more lines and it starts to add up...

dmckee 2009-04-20 20:17:55

I don't think you really need to look at the whole quadratic set of lines, either. Log files are time ordered and formatted, it's easy to skim out the ones that are far too far apart.

Charlie Martin 2009-04-20 22:55:21

Tell you what: why don't you show us some example lines that you think are similar, and notably different.

Charlie Martin 2009-04-20 22:57:12

Answer 2

+2 A:

You could try a bit of code that counts words, and then sorts lines by those having the least common words.

If that doesn't do the trick, you can add in some smarts to filter out time stamps and numbers.

Your problem is similar to an earlier question on generating summaries of news stories.

RossFabricant 2009-04-20 19:35:15

+1, novel approach.

j_random_hacker 2009-04-20 21:00:52

This is what I used, with the structure of @dmckee's solution. Very simple and effective!

Bluu 2009-04-22 19:36:24

Answer 3

+2 A:

I don't know a tool for you but if I were going to roll my own, I'd approach it like this:

Presumably the log lines have a well defined structure, no? So

parse the lines on that structure
write a number of very basic relevance filters (functions that just return a simple number from the parsed structure)
run the parsed lines through a set of filters, and cut on the basis of the total score
possibly sort the remaining lines into various bins by the results of more filters
generate reports, dump bins to files, or other output

If you are familiar with the unix tool procmail, I'm suggesting a similar treatment customized for your data.

As zacherates notes in the comments, your filters will typically ignore time stamps (and possibly IP address), and just concentrate on the content: for example really long http requests might represent an attack...or whatever applies to your domain.

Your binning filters might be as simple as a hash on a few selected fields, or you might try to do something with Charlie Martin's suggestion and used edit distance measures.

dmckee 2009-04-20 19:37:47

This approach also allows you to determine uniqueness based on the structure of the log line (so you can ignore ip address, time stamp, etc. when batching lines).

Aaron Maenpaa 2009-04-20 19:48:32

I love how general this is. This basic approach will work whether I'm working on a highly predictable, structured log file, or if I'm dealing with natural language.In the case of my log, I'm finding @rossfabricant's relevance filter of uncommon words quick, dirty, and very helpful.

Bluu 2009-04-22 19:35:21

"I love how general this is." Me too. That why I stole it. All admiration should filter back to the procmail guys, or whoever they got it from.

dmckee 2009-04-22 19:52:24

Answer 4

A:

I wonder if you could just focus on the part that defines uniqueness for you. In this case, it seems that the part defining uniqueness is just the middle part:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
                    ^---------------------^ 

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
                    ^--------------------------------^

I would then compare exactly this part, perhaps using a regular expression (just the parenthesized group; how to access sub-matches like this is language dependent):

/^.{20}(\w+\s+[\w\.-]+\s+-\s+\w+)/

Svante 2009-04-21 00:26:14

Answer 5

A:

ja 2009-04-21 01:16:32

Answer 6

+1 A:

Perhaps you could do a basic calculation of "words the same"/"all words"?

e.g. (including an offset to allow you to ignore the timestamp and the word 'INFO', if that's always the same):

def score(s1, s2, offset=26):
    words1 = re.findall('\w+', s1[offset:])
    words2 = re.findall('\w+', s2[offset:])
    return float(len(set(words1) & set(words2)))/max(len(set(words1)), len(set(words2)))

Given:

>>> s1
'2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234'
>>> s2
'2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk'
>>> s3
'2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses'
>>> s4
'2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses'

This yields:

>>> score(s1,s2)
0.8571428571428571
>>> score(s3,s4)
0.75
>>> score(s1,s3)
0.066666666666666666

You've still got to decide which lines to compare. Also the use of set() may distort the scores slightly – the price of a simple algorithm :-)

John Fouhy 2009-04-21 02:21:56

ansaurus

tags:

views:

answers:

The lines that stand out in a file, but aren't exact duplicates

related questions