views:

304

answers:

4

BeautifulSoup newbe... Need help

Here is the code sample...

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mec = Browser()
#url1 = "http://www.wines.com/catalog/index.php?cPath=21"
url2 = "http://www.wines.com/catalog/product_info.php?products_id=4866"
page = mec.open(url2)
html = page.read()
soup = BeautifulSoup(html)

print soup.prettify()

When I use url1 I get a nice dump of the page. When I use url2(the one I need). I get output without the body.

<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html dir="LTR" lang="en">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>
   2005 Jordan Cabernet Sauvignon Sonoma 2005
  </title>
 </head>
</html>

Any ideas?

+2  A: 

Yes. The HTML is bad.

Step 1a, print soup.prettify() and see where it stops indenting correctly.

Step 1b (if 1a doesn't work). Just print the raw through any HTML prettifying. I use BBEdit for things that confuse Beautiful Soup.

Look closely at the HTML. There will be some kind of horrible error. Misplaced " characters is common. Also, the CSS background-image when given as a style has bad quotes.

<tag style="background-image:url("something")">

Note the "improper" quotes. You'll need to write an Regex to find and fix these.

Step 2. Write a "massage" regular expression and function to fix this. Read the http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps section for how to create a markup massage.

Here's what I often use

# Fix background-image:url("some URI")
# to replace the quotes with &quote;
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
    return 'background-image:url(&quote;%s&quote;)' % ( match.group(1) )
# Fix <img src="some URI name="someString"">  -- note the out-of-place quotes
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
    return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )
fix_style_quotes = [
    (background_image, fix_background_image),
    (bad_img, fix_bad_img),
]
S.Lott
Stops having any contents right after head...!-)
Alex Martelli
A: 

Running on the HTML in question a validator shows 116 errors -- just too many to track down which one BeautifulSoup is proving unable to recover from, I guess:-(

html5lib seems to survive the ordeal of parsing this horror page, and leaves a lot of stuff in (the prettify has just about all of the original page, it seems to me, when you use html5lib's parser to produce a BeautifulSoup object). Hard to say if the resulting parse tree will do what you need, since we don't really know what that is;-).

Note: I've installed html5lib right from the hg clone sources (just python setup.py install from the html5lib/python directory), since the last official release is a bit long in the tooth.

Alex Martelli
html5lib seems to be the route that BeautifulSoup's creator would like us to pursue. He does say "... at the moment it's slower than either SGMLParser or HTMLParser", but nevertheless recommends it as an alternative given that he "no longer enjoy[s] working on Beautiful Soup, but too many people depend on it for me to let the project die just because it depends on code that's been removed from the Python standard library": http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
Jarret Hardie
+1  A: 

The HTML is indeed horrible :-) BeautifulSoup 3.0.7 is much better at handling malformed HTML than the current version. The project website warns: "Currently the 3.0.x series is better at parsing bad HTML than the 3.1 series."... and there's a great page devoted to the reason why, which boils down to the fact that SGMLParser was removed in Python 3, and BS 3.1.x was written to be convertible to Py3k.

The good news is that you can still download 3.0.7a (the last version), which on my machine parses the url you mentioned perfectly: http://www.crummy.com/software/BeautifulSoup/download/3.x/

Jarret Hardie
+2  A: 

It seems to be getting tripped up by this bad tag:

<META NAME="description" CONTENT="$49 at Wines.com "Deep red. Red- and blackcurrant, cherry and menthol on the nose, with subtle vanilla, cola and tobacco notes adding complexity. Tightly wound red berry and bitter cherry flavors are framed by dusty...">

Clearly here they have failed to escape a quote inside the attribute value (uh-oh... site might be vulnerable to cross-site scripting?), and that's making the parser think the rest of the content of the page is all in attribute values. (It would take another " or a > inside one of the real attribute values to make it realise the mistake, I think.)

Unfortunately this is quite a tricky error to fix up after. You could try a slightly different parser, perhaps? eg. try Soup 3.0.x instead of 3.1.x if you're using that version, or vice-versa. Or try html5lib.

bobince
@bobince, good spotting! I gave up on that mess of pottage too soon after managing to run html5lib on it, it seems -- didn't spot the very early error of the extra doublequote. +1 for hawk eyes!-)
Alex Martelli
That's why Beautiful Soup has a "Markup Massage" feature. You can provide an RE to spot this specific problem and repair the damaged quotes.
S.Lott
To be honest I don't know how browsers' parsers are managing to cope with it!
bobince
@bobince: The browsers have huge, complex error fallbacks. Usually they skip tags they can't process. The HTML rules (sadly) are to keep trying and display something. IIRC most browsers will attempt to skip past damaged HTML tags and try to resume parsing at the next ">".
S.Lott