ansaurus

Question

What is the best way to handle a bad link given to BeautifulSoup?

Answer 1

+3 A:

I simply wrap my BeautifulSoup processing and look for the HTMLParser.HTMLParseError exception

import HTMLParser,BeautifulSoup
try:
    soup = BeautifulSoup.BeautifulSoup(raw_html)
    for a in soup.findAll('a'):
        href = a.['href']
        ....
except HTMLParser.HTMLParseError:
    print "failed to parse",url

but further than that, you can check the content type of the responses when you crawl a page and make sure that it's something like text/html or application/xml+xhtml or something like that before you even try to parse it. That should head off most errors.

Jehiah 2009-01-17 06:20:54

ansaurus

tags:

views:

answers:

What is the best way to handle a bad link given to BeautifulSoup?

related questions