tags:

views:

81

answers:

3

I have the following code :

what = re.match("get|post|put|head\s+(\S+) ",data,re.IGNORECASE)

and in the data variable let's say I have this line :

GET some-site.com HTTP/1.0 ...

If I stop the script in the debugger, and inspect the what variable, I can see it only matched GET. Why doesn't it match some-site.com ?

+3  A: 

>>> re.match("(get|post|put|head)\s+(\S+) ",'GET some-site.com HTTP/1.0 ...',re.IGNORECASE).groups()
('GET', 'some-site.com')
>>>                                                                                           

Mykola Kharechko
It works, but can you please explain why my version doesn't work? I only want to capture the second word. I know I can access it by calling .group(1), but I'm puzzled as to why my version didn't work.
Geo
"Why 1+2+3+4*100 is 406 and not 1000"? http://www.amk.ca/python/howto/regex/regex.html#SECTION000510000000000000000 . Read about the "|" character and its precedence.
ΤΖΩΤΖΙΟΥ
+3  A: 

Regex language operator precedence puts head\s+(\S+) as the 4th alternative. The parenthesis in @Mykola Kharechko's answer arrange for head as the 4th alternative, and \s+(\S+) is appended to whatever alternative matched the group.

gimel
+1  A: 

+1 Mykola's answer and gimel's explanation. In addition, do you really want to use regex for this? As you've found out, they are not as straightforward as they look. Here's a non-regex-based method:

def splitandpad(s, find, limit):
    seq= s.split(find, limit)
    return seq+['']*(limit-len(seq)+1)

method, path, protocol= splitandpad(data, ' ', 2)
if method.lower() not in ('get', 'head', 'post', 'put'):
    # complain, unknown method
if protocol.lower() not in ('http/1.0', 'http/1.1'):
    # complain, unknown protocol
bobince
Yeah,I know I could have used split. I started with a regex, and then got angry that it didn't work :)
Geo