ansaurus

Question

Answer 1

+3 A:

import re
re.findall("\?read\.php=(\d+)",data)

S.Mark 2009-12-14 07:17:22

Answer 2

A:

/[0-9]/

thats the regex sytax you want

for reference see

http://gnosis.cx/publish/programming/regular%5Fexpressions.html

Ahmad Dwaik 2009-12-14 07:30:28

Not really helpful, as this addresses only the generic case, also /[0-9]/ only matches a single digit (that is between slashes), so the answer is also incorrect. The correct syntax is in S.Mark's answer.

Kimvais 2009-12-14 08:50:45

Answer 3

+1 A:

One without the need for regex

>>> s='<a href="http://www.example.com?read.php=123"&gt;'
>>> for item in s.split(">"):
...     if "href" in item:
...         print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"

if you want to extract the numbers

item[item.index("a href")+len("a href="): ].split("=")[-1]

2009-12-14 08:29:59

Does not really answer the question, Baha wanted to extract the numbers, not the links

Kimvais 2009-12-14 08:59:17

i believe i do not have the obligation to provide FULL solution. If SO has this policy, then its the same as doing people's homework (if its disguised as one) or something.

2009-12-14 09:36:40

True, you do not have obligation to answer to the question - but the SO policy is to comment down votes, that's why I pointed out that your answer does not really solve the problem, just part of it.

Kimvais 2009-12-14 17:33:11

a lot of the answers in SO does not fully solve problems too. Are you going to downvote everyone of them? IF my answer is totally of another planet, then i am fine with a down vote. But a down vote is uncalled for if i am going the right direction.

2009-12-15 00:21:59

Answer 4

+2 A:

While the other answers are sort of correct, you should probably use the urllib2 library instead;

from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123"&gt;')
for link in links:
    url = urlparse.urlparse(link)
    s = [x.split("=") for x in url[4].split(';')]
    d = {}
    for k,v in s:
        d[k]=v
    print d["read.php"]

It's not as simple as some of the above, but guaranteed to work even with more complex urls.

Kimvais 2009-12-14 08:47:55

don't need a regex to find the whole string. just using "in" operator will do. In fact, regex is not necessary

2009-12-14 09:33:03

You don't need regexp to 'find' a string. To *GET* a part of a string, you have to use something that can express what to take and what to find. Also, if you see the HTML syntax, 'href' is not the only possible attribute for the 'a' tag and it does not have to be the last, or the first. The regexp will match all valid 'a' tags.

Kimvais 2009-12-14 17:31:18

you should also compile your re with IGNORECASE

2009-12-15 00:18:54

good point. added

Kimvais 2009-12-15 09:28:39

Answer 5

+3 A:

"If you have a problem, and decide to use regex, now you have two problems..."

If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,

I recommend mechanize. If you notice, the Browser class there has a links method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .

Claudiu 2009-12-14 08:52:25

Answer 6

+2 A:

This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/> and some look like <A TARGET="_blank" HREF='foo=123'/>).

import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
   m = p.match(a["href"])
   if m:
      print m.groups()[0]

Robert Rossney 2009-12-14 18:15:11

ansaurus

tags:

views:

answers:

python url fetch help - regex

related questions