views:

139

answers:

3

Having a problem with parsing Snort logs using the pyparsing module.

The problem is with separating the Snort log (which has multiline entries, separated by a blank line) and getting pyparsing to parse each entry as a whole chunk, rather than read in line by line and expecting the grammar to work with each line (obviously, it does not.)

I have tried converting each chunk to a temporary string, stripping out the newlines inside each chunk, but it refuses to process correctly. I may be wholly on the wrong track, but I don't think so (a similar form works perfectly for syslog-type logs, but those are one-line entries and so lend themselves to your basic file iterator / line processing)

Here's a sample of the log and the code I have so far:

[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3] 
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3  Code:10  DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP

[**] ...more like this [**]

And the updated code:

def snort_parse(logfile):
    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
    cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
    pri = Suppress("[Priority:") + integer + Suppress("]")
    date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
    src_ip = ip_addr + Suppress("->")
    dest_ip = ip_addr
    extra = Regex(".*")

    bnf = header + cls + pri + date + src_ip + dest_ip + extra

    def logreader(logfile):
        chunk = []
        with open(logfile) as snort_logfile:
            for line in snort_logfile:
                if line !='\n':
                    line = line[:-1]
                    chunk.append(line)
                    continue
                else:
                    print chunk
                    yield " ".join(chunk)
                    chunk = []

    string_to_parse = "".join(logreader(logfile).next())
    fields = bnf.parseString(string_to_parse)
    print fields

Any help, pointers, RTFMs, You're Doing It Wrongs, etc., greatly appreciated.

A: 

Well, I don't know Snort or pyparsing, so apologies in advance if I say something stupid. I'm unclear as to whether the problem is with pyparsing being unable to handle the entries, or with you being unable to send them to pyparsing in the right format. If the latter, why not do something like this?

def logreader( path_to_file ):
    chunk = [ ]
    with open( path_to_file ) as theFile:
        for line in theFile:
            if line:
                chunk.append( line )
                continue
            else:
                yield "".join( *chunk )
                chunk = [ ]

Of course, if you need to modify each chunk before sending it to pyparsing, you can do so before yielding it.

katrielalex
Thanks, this is a lot cleaner than the original, however, still chokes on expecting [**] as the second line rather than the first line of the next chunk.
serverninja
I'm still not sure I understand. Do you mean that `pyparsing` isn't understanding the chunks? I believe it treats newlines as whitespace and ignores them.
katrielalex
+5  A: 
import pyparsing as pyp
import itertools

integer=pyp.Word(pyp.nums)
ip_addr=pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer)

def snort_parse(logfile):
    header = (pyp.Suppress("[**] [")
              + pyp.Combine(integer + ":" + integer + ":" + integer)
              + pyp.Suppress(pyp.SkipTo("[**]",include=True)))
    cls = (
        pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:")))
        + pyp.Regex("[^]]*") + pyp.Suppress(']'))

    pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]")
    date = pyp.Combine(
        integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer)
    src_ip = ip_addr + pyp.Suppress("->")
    dest_ip = ip_addr

    bnf=header+cls+pri+date+src_ip+dest_ip

    with open(logfile) as snort_logfile:
        for has_content,grp in itertools.groupby(
                snort_logfile,key=lambda x: bool(x.strip())):
            if has_content:
                tmpStr=''.join(grp)
                fields = bnf.searchString(tmpStr)
                print(fields)

snort_parse('snort_file')

yields

[['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]
unutbu
You are a god. This is a solution beyond my expertise but will soon be implemented, as soon as I make myself understand all the working parts. Thank you!
serverninja
+1 - Nice answer ~unutbu, beat me to the punch! (Your groupby code looks pretty crazy, I'll have to sort through it when I get a few minutes.)
Paul McGuire
+many just for the lovely and elegant use of `groupby`.
katrielalex
Thanks all for the kind words :)
unutbu
+4  A: 

You have some regex unlearning to do, but hopefully this won't be too painful. The biggest culprit in your thinking is the use of this construct:

some_stuff + Regex(".*") + 
                 Suppress(string_representing_where_you_want_the_regex_to_stop)

Each subparser within a pyparsing parser is pretty much standalone, and works sequentially through the incoming text. So the Regex term has no way to look ahead to the next expression to see where the '*' repetition should stop. In other words, the expression Regex(".*") is going to just read until the end of the line, since that is where ".*" stops without specifying multiline.

In pyparsing, this concept is implemented using SkipTo. Here is how your header line is written:

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + Regex(".*") + Suppress("[**]") 

Your ".*" problem gets resolved by changing it to:

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + SkipTo("[**]") + Suppress("[**]") 

Same thing for cls.

One last bug, your definition of date is short by one ':' + integer:

date = integer + "/" + integer + "-" + integer + ":" + integer + "." + 
          Suppress(integer) 

should be:

date = integer + "/" + integer + "-" + integer + ":" + integer + ":" + 
          integer + "." + Suppress(integer) 

I think those changes will be sufficient to start parsing your log data.

Here are some other style suggestions:

You have a lot of repeated Suppress("]") expressions. I've started defining all my suppressable punctuation in a very compact and easy to maintain statement like this:

LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")

(expand to add whatever other punctuation characters you like). Now I can use these characters by their symbolic names, and I find the resulting code a little easier to read.

You start off header with header = Suppress("[**] [") + .... I never like seeing spaces embedded in literals this way, as it bypasses some of the parsing robustness pyparsing gives you with its automatic whitespace skipping. If for some reason the space between "[**]" and "[" was changed to use 2 or 3 spaces, or a tab, then your suppressed literal would fail. Combine this with the previous suggestion, and header would begin with

header = Suppress("[**]") + LBRACK + ...

I know this is generated text, so variation in this format is unlikely, but it plays better to pyparsing's strengths.

Once you have your fields parsed out, start assigning results names to different elements within your parser. This will make it a lot easier to get the data out afterward. For instance, change cls to:

cls = Optional(Suppress("[Classification:") + 
             SkipTo(RBRACK)("classification") + RBRACK) 

Will allow you to access the classification data using fields.classification.

Paul McGuire
Yes. I admit I definitely reached for the regex hammer on this one (You should see it, it's pretty unwieldy) -- but dove into pyparsing last night and this is what I'd arrived at. Definitely some paradigm shifting, but with the vast quantity of data and moreso, the variance of the data and fields, pyparsing was the only other option.Thank you for your insight!
serverninja
And from the author of pyparsing nontheless! Thanks again Paul!
serverninja
Comment with a question: is it no longer necessary to use the setResultsName() method for naming fields? It looks like an implied shortcut above, but I can't find this in the documentation. Thanks!
serverninja
This was added back in version 1.4.7, and there are mentions of it in the CHANGES file, and in HowToUsePyparsing.html. But I don't see any mention of it in the generated epydocs, I'll have to fix that. This shortcut works only if listAllMatches is to be set to the default value of False - if you want to pass it as True, then you'll need to use the full method call form.
Paul McGuire