This was originally a question I wanted to ask, but while researching the details for the question I found the solution and thought it may be of interest to others.
In Apache, the full request is in double quotes and any quotes inside are always escaped with a backslash:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"
I'm trying to construct a regex which matches all distinct fields. My current solution always stops on the first quote after the GET
/POST
(actually I only need all the values including the size transferred):
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)
I guess I'll also provide my solution from my PHP source with comments and better formatting:
$sPattern = ';^' .
# ip address: 1
'(\d+\.\d+\.\d+\.\d+)' .
# ident and user id
'\s+[^\s]+\s+[^\s]+\s+' .
# 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
'\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
# whitespace
'\s+' .
# request uri
'"[^"]+"' .
# whitespace
'\s+' .
# 8 status code
'(\d+)' .
# whitespace
'\s+' .
# 9 bytes sent
'(\d+|-)' .
# end of regex
';';
Using this with a simple case where the URL doesn't contain other quotes works fine:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"
Now I'm trying to get support for none, one or more occurrences of \"
into it, but can't find a solution. Using regexpal.com I've came up with this so far:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"
Here's only the changed part:
# request uri
'"(.|\\(?="))*"' .
However, it's too greedy. It eats everything until the last "
, when it should only eat until the first "
not preceded by a \
. I also tried introducing the requirement that there's no \
before the "
I want, but it still eats to the end of the string (Note: I had to add extraneous \
characters to make this work in PHP):
# request uri
'"(.|\\(?="))*[^\\\\]"' .
But then it hit me: *?
: If used immediately after any of the quantifiers , +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times)
# request uri
'"(.|\\(?="))*?[^\\\\]"' .
The full regex:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)
Update 5th May 2009:
I discovered a small flaw in the regexp due parsing millions of lines: it breaks on lines which contain the backslash character right before the double quote. In other words:
...\\"
will break the regex. Apache will not log ...\"
but will always escape the backslash to \\
, so it's safe to assume that when there're two backslash characters before the double quote.
Anyone has an idea how to fix this with the the regex?
Helpful resources: the JavaScript Regexp documentation at developer.mozilla.org and regexpal.com