views:

75

answers:

3

I have some HUGE log files (50Mb; ~500K lines) I need to start filtering some of the crap out of. The log files are being produced using log4j and have the basic pattern of:

[log-level] date-time class etc, etc  
log-message  

I'm looking for a way that I can identify a regex start and regex end (or something similar) that will filter out the matching entries from the file so I can more easily wade through these massive files. My thoughts are that the start regex would be the log-level and the end regex would be something in the log-message. I'm sure I could write a java program to accomplish this task, but I thought I'd ask the community before going down that path. Thanks in advance.


Let me expand on my question. Let's assume I have the following snippet in my log file:

[DEBUG] date-time class etc, etc  
log-message-1

[WARN] date-time class etc, etc  
log-message-2

[DEBUG] date-time class etc, etc  
log-message-3

[DEBUG] date-time class etc, etc  
log-message-1

[WARN] date-time class etc, etc  
log-message-2

[DEBUG] date-time class etc, etc  
log-message-6

I'd like a way to filter out logEntry1 and logEntry2 so I end up with:

[DEBUG] date-time class etc, etc  
log-message-3

[DEBUG] date-time class etc, etc  
log-message-6

I would hope to accomplish this be defining some sets of regex patterns pairs. In my example above, I'd want to define a pair for logEntry1 and another for logEntry2.

I hope that helps clarify my question.

+1  A: 
(zyx:~) % echo $T
[DEBUG] date-time class etc, etc  
log-message-1

[WARN] date-time class etc, etc  
log-message-2

[DEBUG] date-time class etc, etc  
log-message-3

[DEBUG] date-time class etc, etc  
log-message-1

[WARN] date-time class etc, etc  
log-message-2

[DEBUG] date-time class etc, etc  
log-message-6
(zyx:~) % echo $T | perl -e '$_=join("", <>); s/\[DEBUG\][^\n]*\n(log-message-1|log-message-2).*?(?=\n\[(DEBUG|WARN)\]|$)//sg; s/\[WARN\].*?(?=\n\[(DEBUG|WARN)\]|$)//sg; print;'


[DEBUG] date-time class etc, etc  
log-message-3



[DEBUG] date-time class etc, etc  
log-message-6
ZyX
no-no-no. please, dont' create multigb strings in perl with `$_=join("", <>);`
osgx
Author said he has 50 MiB file. If he said about 2 GiB file, I would have written other script.
ZyX
A: 

Use awk or awk-styled perl one-liners.

osgx
sure sure... assuming I'm a awk or perl expert, which I'm not
fmpdmb
awk is VERY easy to learn. You need a little of awk to parse such files.perl can be used in the same awk-style with easy syntax.
osgx
+3  A: 

Assuming log-message-1 and log-message-2 and unique patterns.

$ awk -vRS= '!/log-message-[12]/' ORS="\n\n" file
[DEBUG] date-time class etc, etc
log-message-3

[DEBUG] date-time class etc, etc
log-message-6
ghostdog74
I'm not sure I understand what this is doing. This doesn't specify the start regex.I did notice that this trimmed out all blank lines from my log.
fmpdmb
I still don't understand what this is doing, but it seems to be working. I believe I can take this snippet, define my set of regexes in a file, read the file, loop over each regex executing your snippet, and I should be there.
fmpdmb
the command set the record separator to blank lines, so each block from `[..]` to the blank line is considered 1 record. then the pattern search for records that DON'T have the words `log-message-1` or `log-message-2` and print them out. That's all there is.
ghostdog74