I have a web site where there are links like <a href="http://www.example.com?read.php=123">
Can anybody show me how to get all the numbers (123, in this case) in such links using python? I don't know how to construct a regex. Thanks in advance.
views:
121answers:
6/[0-9]/
thats the regex sytax you want
for reference see
http://gnosis.cx/publish/programming/regular%5Fexpressions.html
One without the need for regex
>>> s='<a href="http://www.example.com?read.php=123">'
>>> for item in s.split(">"):
... if "href" in item:
... print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"
if you want to extract the numbers
item[item.index("a href")+len("a href="): ].split("=")[-1]
While the other answers are sort of correct, you should probably use the urllib2 library instead;
from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123">')
for link in links:
url = urlparse.urlparse(link)
s = [x.split("=") for x in url[4].split(';')]
d = {}
for k,v in s:
d[k]=v
print d["read.php"]
It's not as simple as some of the above, but guaranteed to work even with more complex urls.
"If you have a problem, and decide to use regex, now you have two problems..."
If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,
I recommend mechanize. If you notice, the Browser
class there has a links
method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .
This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/>
and some look like <A TARGET="_blank" HREF='foo=123'/>
).
import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
m = p.match(a["href"])
if m:
print m.groups()[0]