ansaurus

Question

Decomposing HTML to link text and target

Answer 1

+6 A:

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

EDIT:

I think you want:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

EDIT 2:

This should show you all the links in a page:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link

Harley 2008-11-13 00:40:29

I agree, Beatiful Soup is probably the better way to handle this.

monkut 2008-11-13 00:44:28

would it be better to open the url elsewhere and check for errors there itself?

sundeep 2008-11-13 01:30:28

Yes, and have a try...except around it just in case it fails.

Harley 2008-11-13 01:36:30

also,what does the u'text' mean ? thanks for the help.

sundeep 2008-11-13 01:39:20

The 'u' before the string means it is in Unicode. See wikipedia for what that means. It shouldn't affect you too much.

Harley 2008-11-13 01:44:39

Answer 2

+1 A:

Using Regular Expressions to parse XML is a bad idea, it is far far too difficult to do reliably. You'd be much better off using a library designed for that purpose, such as Beautiful Soup that Harley suggested.

Jeremy Banks 2008-11-13 00:41:50

Answer 3

+4 A:

Here's a code example, showing getting the attributes and contents of the links:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
    print link.attrs, link.contents

Jerub 2008-11-13 00:48:43

Answer 4

+1 A:

Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.

/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/

Here's what it matches:

'<a href="url" close="true">text</a>'
// Parts: "url", "text"

'<a href="url" close="true">text<span>something</span></a>'
// Parts: "url", "text<span>something</span>"

If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

nickf 2008-11-13 00:51:54

With this approach you need to watch out for line breaks in the source code. Make sure you set the flag re.DOTALL when you compile your pattern.

tgray 2009-08-24 13:32:06

Answer 5

+4 A:

Looks like you have two issues there:

link.content**s**, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.

Tom 2008-11-13 01:23:56

yes , it was the content*s* issue .. im a dumbass. thanks !

sundeep 2008-11-13 01:26:23

ansaurus

tags:

views:

answers:

Decomposing HTML to link text and target

related questions