tags:

views:

513

answers:

4

I'm writing a python script to extract data out of our 2GB Apache access log. Here's one line from the log.

81.52.143.15 - - [01/Apr/2008:00:07:20 -0600] "GET /robots.txt HTTP/1.1" 200 29 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (http://www.voila.com/)"

I'm trying to get the date portion from that line, and regex is failing me, and I'm not sure why. Here's my python code:

l = 81.52.143.15 - - [01/Apr/2008:00:07:20 -0600] "GET /robots.txt HTTP/1.1" 200 29 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (http://www.voila.com/)"

re.match(r"\d{2}/\w{3}/\d{4}", l)

returns nothing. Neither do the following:

re.match(r"\d{2}/", l)
re.match(r"\w{3}", l)

or anything else I can thing of to even get part of the date. What am I misunderstanding?

A: 

match() tries to match the entire string. Try search() instead.

See also the Python Regular Expression HOWTO, and the Python page at the always-excellent regular-expressions.info.

Michael Myers
+5  A: 

match() looks for a match at the beginning of the string. Use search() to look for a match anywhere in the string. More info here: http://docs.python.org/library/re.html#matching-vs-searching

Swingley
How did I miss that. Thanks for the quick reply.
saturdayplace
A: 

Rather than using regular expressions to get the date, it might be easier to just split the line on spaces and extract the date:

 l = '81.52.143.15 - - [01/Apr/2008:00:07:20 -0600] "GET /robots.txt HTTP/1.1" 200 29 "-" Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (http://www.voila.com/)"'
 date = l.split()[3]

If you're processing very large files, this is probably more efficient than using regular expressions.

Jason Abate
I thought about that too, but I also want grab the user-agent string, and splitting on spaces wrecks that. Also, personally `re.search('\d{2}/\w{3}/\d{4}` seems more semantic (find two digits/three characters/four digits) than `l.split()[3]` (find the fourth chunk in the string). For future readability.
saturdayplace
A: 

Or you can use one of already available python apache log parsers like :

  • Apachelogs
  • Logtools
  • Logrep (Wtop package)
miniwark