views:

96

answers:

5

Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this:

http://www.example.com/online/store/TitleDetail?detail&sku=123456789

from this:

('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&amp;sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t            \tcheck store inventory\r\n\t\t\t            </a>', 1)

This is where I got confused. any ideas?

Edit: the sku number changes per product so therein lies the trouble for me

+1  A: 
http://www\.example\.com/online/store/TitleDetail\?detail&amp;sku=\d+

use the \d group with a "Greedy" +, to qualify any integer value in the sku field

arthurprs
This def worked. Thanks!
Diego
A: 

You don't need regular expressions for that, just use string methods:

result = html[0].split("window.location='")[1].split("'")[0]
David Morrissey
A: 
pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&amp;sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)
Matthew Flaschen
A: 

if there are always 9 digits

http://www.example.com/online/store/TitleDetail?detail&amp;sku=[0-9]{9}

if there are an arbitrary number of digits:

http://www.example.com/online/store/TitleDetail?detail&amp;sku=[0-9]*

more general:

http*?sku=[0-9]*

(the ? in *? means it will find shorter matches first, so it is less likely to find a match that spans multiple URLs.)

edit: [0-9]. not [1-9]

themissinglint
A: 

http://txt2re.com/ might help you

Zach