views:

77

answers:

3

Suppose you have some this String (one line)

10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ HTTP/1.0" 302 528 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"

and you want to extract the part between the GET and HTTP (i.e., some url) but only if it contains the word 'puzzle'. How would you do that using regular expressions in Python?

Here's my solution so far.

match = re.search(r'GET (.*puzzle.*) HTTP', my_string)

It works but I have something in mind that I have to change the first/second/both .* to .*? in order for them to be non-greedy. Does it actually matter in this case?

+5  A: 

No need regex

>>> s
'10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ HTTP/1.0" 302 528 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"'

>>> s.split("HTTP")[0]
'10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ '

>>> if "puzzle" in s.split("HTTP")[0].split("GET")[-1]:
...   print "found puzzle"
...
ghostdog74
+2  A: 

Hi,

It does matter. The User-Agent can contain anything. Use non-greedy for both of them.

Alin Purcaru
In its current form, it only matters if there would be more than one GET-HTTP strings in a single line wich I doubt there ever will be. It would be the safer choice though to make it non-greedy.
Lieven
+1  A: 
>>> s = '10.254.254.28 - - [06/Aug/2007:00:12:20 -0700] "GET /keyser/22300/ HTTP/1.0" 302 528 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"'
>>> s.split()[6]
'/keyser/22300/'
SilentGhost
Log messages sometimes have non-blank content between the two dashes, which would throw off the indexing in your split.
Paul McGuire
nothing a trivial if statement wouldn't fix
SilentGhost