tags:

views:

121

answers:

6

I have a web site where there are links like <a href="http://www.example.com?read.php=123"&gt; Can anybody show me how to get all the numbers (123, in this case) in such links using python? I don't know how to construct a regex. Thanks in advance.

+3  A: 
import re
re.findall("\?read\.php=(\d+)",data)
S.Mark
A: 

/[0-9]/

thats the regex sytax you want

for reference see

http://gnosis.cx/publish/programming/regular%5Fexpressions.html

Ahmad Dwaik
Not really helpful, as this addresses only the generic case, also /[0-9]/ only matches a single digit (that is between slashes), so the answer is also incorrect. The correct syntax is in S.Mark's answer.
Kimvais
+1  A: 

One without the need for regex

>>> s='<a href="http://www.example.com?read.php=123"&gt;'
>>> for item in s.split(">"):
...     if "href" in item:
...         print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"

if you want to extract the numbers

item[item.index("a href")+len("a href="): ].split("=")[-1]
Does not really answer the question, Baha wanted to extract the numbers, not the links
Kimvais
i believe i do not have the obligation to provide FULL solution. If SO has this policy, then its the same as doing people's homework (if its disguised as one) or something.
True, you do not have obligation to answer to the question - but the SO policy is to comment down votes, that's why I pointed out that your answer does not really solve the problem, just part of it.
Kimvais
a lot of the answers in SO does not fully solve problems too. Are you going to downvote everyone of them? IF my answer is totally of another planet, then i am fine with a down vote. But a down vote is uncalled for if i am going the right direction.
+2  A: 

While the other answers are sort of correct, you should probably use the urllib2 library instead;

from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123"&gt;')
for link in links:
    url = urlparse.urlparse(link)
    s = [x.split("=") for x in url[4].split(';')]
    d = {}
    for k,v in s:
        d[k]=v
    print d["read.php"]

It's not as simple as some of the above, but guaranteed to work even with more complex urls.

Kimvais
don't need a regex to find the whole string. just using "in" operator will do. In fact, regex is not necessary
You don't need regexp to 'find' a string. To *GET* a part of a string, you have to use something that can express what to take and what to find. Also, if you see the HTML syntax, 'href' is not the only possible attribute for the 'a' tag and it does not have to be the last, or the first. The regexp will match all valid 'a' tags.
Kimvais
you should also compile your re with IGNORECASE
good point. added
Kimvais
+3  A: 

"If you have a problem, and decide to use regex, now you have two problems..."

If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,

I recommend mechanize. If you notice, the Browser class there has a links method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .

Claudiu
+2  A: 

This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/> and some look like <A TARGET="_blank" HREF='foo=123'/>).

import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
   m = p.match(a["href"])
   if m:
      print m.groups()[0]
Robert Rossney