views:

929

answers:

5

Given an HTML link like

<a href="urltxt" class="someclass" close="true">texttxt</a>

how can I isolate the url and the text?

Updates

I'm using Beautiful Soup, and am unable to figure out how to do that.

I did

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

i get

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

Why am i missing the content?

edit: elaborated on 'stuck' as advised :)

+6  A: 

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

EDIT:

I think you want:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

EDIT 2:

This should show you all the links in a page:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link
Harley
I agree, Beatiful Soup is probably the better way to handle this.
monkut
would it be better to open the url elsewhere and check for errors there itself?
sundeep
Yes, and have a try...except around it just in case it fails.
Harley
also,what does the u'text' mean ? thanks for the help.
sundeep
The 'u' before the string means it is in Unicode. See wikipedia for what that means. It shouldn't affect you too much.
Harley
+1  A: 

Using Regular Expressions to parse XML is a bad idea, it is far far too difficult to do reliably. You'd be much better off using a library designed for that purpose, such as Beautiful Soup that Harley suggested.

Jeremy Banks
+4  A: 

Here's a code example, showing getting the attributes and contents of the links:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
    print link.attrs, link.contents
Jerub
+1  A: 

Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.

/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/

Here's what it matches:

'<a href="url" close="true">text</a>'
// Parts: "url", "text"

'<a href="url" close="true">text<span>something</span></a>'
// Parts: "url", "text<span>something</span>"

If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

nickf
With this approach you need to watch out for line breaks in the source code. Make sure you set the flag re.DOTALL when you compile your pattern.
tgray
+4  A: 

Looks like you have two issues there:

  1. link.content**s**, not link.content
  2. attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Tom
yes , it was the content*s* issue .. im a dumbass. thanks !
sundeep