views:

53

answers:

5

I am looking for a regex pattern matcher for a String in HttpLogFormat. The log is generated by haproxy. Below is a sample String in this format.

Feb 6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"

An explanation of the format is available at HttpLogFormat. Any help is appreciated.

I am trying to get the individual peices of information included in that line. Here are the fields:

  1. process_name '[' pid ']:'
  2. client_ip ':' client_port
  3. '[' accept_date ']'
  4. frontend_name
  5. backend_name '/' server_name
  6. Tq '/' Tw '/' Tc '/' Tr '/' Tt*
  7. status_code
  8. bytes_read
  9. captured_request_cookie
  10. captured_response_cookie
  11. termination_state
  12. actconn '/' feconn '/' beconn '/' srv_conn '/' retries
  13. srv_queue '/' backend_queue
  14. '{' captured_request_headers* '}'
  15. '{' captured_response_headers* '}'
  16. '"' http_request '"'
+1  A: 

That looks like a very complicated string to match on. I would recommend using a tool like Expresso. Start with the string you are trying to match then start replacing pieces of it with Regex notation.

To grab individual pieces, use grouping parentheses.

The other option would be to make a regex for each piece you are trying to grab.

Seattle Leonard
+1  A: 

Use at your own peril.

This assumes that all fields return something except for the ones you have marked with asterisks (is that what the asterisk means)? There are also obvious failure cases such as nested brackets of any kind, but if the logger prints reasonably sane messages, then I guess you'd be okay...

Of course, even I personally wouldn't want to have to maintain this, but there you have it. You might want to consider writing a regular ol' parser for this instead, if you can.

Edit: Marked this as CW since it's more of a "I wonder how this will turn out" kind of answer than anything else. For quick reference, this is what I ended up constructing in rubular:

^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$

My first programming language was Perl, and even I'm willing to admit that I'm frightened by that.

eldarerathis
+1 just for putting that nasty thing out! I'll try it out and update how it goes.
Thimmayya
A: 

I don't think regex is your best option here...however, if it's your ONLY option...

Try looking at these options instead. http://serverfault.com/q/62687/438

Keng
what other options do you suggest?
Thimmayya
@Thimmayya I think Splunk would be at the top of my list. http://www.splunk.com/
Keng
A: 

Why are you trying to match the line precisely ? If you're looking for specific fields in it, better specify which ones and extract them. If you want to run statisticts on haproxy logs, you should take a look at the "halog" tool in the "contrib" directory in the sources. Take the one from version 1.4.9, it even knows how to sort URLs by response time.

But whatever you want to do with those lines, regex will probably always be the slowest and most complex solution.

Willy Tarreau
+1  A: 

Regex:

^(\w+ \d+ \S+) (\S+) (\S+)\[(\d+)\]: (\S+):(\d+) \[(\S+)\] (\S+) (\S+)/(\S+) (\S+) (\S+) (\S+) *(\S+) (\S+) (\S+) (\S+) (\S+) \{([^}]*)\} \{([^}]*)\} "(\S+) ([^"]+) (\S+)" *$

Results:

Group 1:    Feb 6 12:14:14
Group 2:    localhost
Group 3:    haproxy
Group 4:    14389
Group 5:    10.0.1.2
Group 6:    33317
Group 7:    06/Feb/2009:12:14:14.655
Group 8:    http-in
Group 9:    static
Group 10:   srv1
Group 11:   10/0/30/69/109
Group 12:   200
Group 13:   2750
Group 14:   -
Group 15:   -
Group 16:   ----
Group 17:   1/1/1/1/0
Group 18:   0/0
Group 19:   1wt.eu
Group 20:   
Group 21:   GET
Group 22:   /index.html
Group 23:   HTTP/1.1

I use RegexBuddy for composing complex regular expressions.

Mike Clark