I'm combing a webapp's log file for statements that stand out.
Most of the lines are similar and uninteresting. I'd pass them through Unix uniq
, however that filters nothing, as all the lines are slightly different: they all have a different timestamp, similar statements might print a different user ID, etc.
What's a way and/or tool to get just the lines that are notably different from any other? (But, again, not precise duplicates)
I was thinking about playing with Python's difflib but that seems geared toward diffing two files, rather than all pairs of lines in the same file.
[EDIT]
I assumed the solution would give a uniqueness score for each line. So by "notably different" I meant, I choose a threshold that the uniqueness score must exceed for any line to be included in the output.
Based on that, if there are other viable ways to define it, please discuss. Also, the method doesn't have to have 100% accuracy and recall.
[/EDIT]
Examples:
I'd prefer answers that are as general purpose as possible. I know I can strip away the timestamp at the beginning. Stripping the end is more challenging, as its language may be absolutely unlike anything else in the file. These sorts of details are why I shied from concrete examples before, but because some people asked...
Similar 1:
2009-04-20 00:03:57 INFO com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:04:02 INFO com.foo.Bar - URL:/graph?id=asdfghjk
Similar 2:
2009-04-20 00:05:59 INFO com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
2009-04-20 00:06:00 INFO com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses
Different 1:
2009-04-20 00:03:57 INFO com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:05:59 INFO com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
In the Different 1 case, I'd like both lines returned but not other lines like them. In other words, those 2 lines are distinct types (then I can later ask for only statistically rare line types). The edit distance is much bigger between those two, for one thing.