ansaurus

Question

Answer 1

+2 A:

Yes. The HTML is bad.

Step 1a, print soup.prettify() and see where it stops indenting correctly.

Step 1b (if 1a doesn't work). Just print the raw through any HTML prettifying. I use BBEdit for things that confuse Beautiful Soup.

Look closely at the HTML. There will be some kind of horrible error. Misplaced " characters is common. Also, the CSS background-image when given as a style has bad quotes.

<tag style="background-image:url("something")">

Note the "improper" quotes. You'll need to write an Regex to find and fix these.

Step 2. Write a "massage" regular expression and function to fix this. Read the http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps section for how to create a markup massage.

Here's what I often use

# Fix background-image:url("some URI")
# to replace the quotes with &quote;
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
    return 'background-image:url(&quote;%s&quote;)' % ( match.group(1) )
# Fix <img src="some URI name="someString"">  -- note the out-of-place quotes
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
    return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )
fix_style_quotes = [
    (background_image, fix_background_image),
    (bad_img, fix_bad_img),
]

S.Lott 2009-11-09 01:38:46

Stops having any contents right after head...!-)

Alex Martelli 2009-11-09 01:44:35

Answer 2

A:

Running on the HTML in question a validator shows 116 errors -- just too many to track down which one BeautifulSoup is proving unable to recover from, I guess:-(

html5lib seems to survive the ordeal of parsing this horror page, and leaves a lot of stuff in (the prettify has just about all of the original page, it seems to me, when you use html5lib's parser to produce a BeautifulSoup object). Hard to say if the resulting parse tree will do what you need, since we don't really know what that is;-).

Note: I've installed html5lib right from the hg clone sources (just python setup.py install from the html5lib/python directory), since the last official release is a bit long in the tooth.

Alex Martelli 2009-11-09 01:46:32

html5lib seems to be the route that BeautifulSoup's creator would like us to pursue. He does say "... at the moment it's slower than either SGMLParser or HTMLParser", but nevertheless recommends it as an alternative given that he "no longer enjoy[s] working on Beautiful Soup, but too many people depend on it for me to let the project die just because it depends on code that's been removed from the Python standard library": http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

Jarret Hardie 2009-11-09 01:50:35

Answer 3

+1 A:

The HTML is indeed horrible :-) BeautifulSoup 3.0.7 is much better at handling malformed HTML than the current version. The project website warns: "Currently the 3.0.x series is better at parsing bad HTML than the 3.1 series."... and there's a great page devoted to the reason why, which boils down to the fact that SGMLParser was removed in Python 3, and BS 3.1.x was written to be convertible to Py3k.

The good news is that you can still download 3.0.7a (the last version), which on my machine parses the url you mentioned perfectly: http://www.crummy.com/software/BeautifulSoup/download/3.x/

Jarret Hardie 2009-11-09 01:46:46

Answer 4

+2 A:

It seems to be getting tripped up by this bad tag:

<META NAME="description" CONTENT="$49 at Wines.com "Deep red. Red- and blackcurrant, cherry and menthol on the nose, with subtle vanilla, cola and tobacco notes adding complexity. Tightly wound red berry and bitter cherry flavors are framed by dusty...">

Clearly here they have failed to escape a quote inside the attribute value (uh-oh... site might be vulnerable to cross-site scripting?), and that's making the parser think the rest of the content of the page is all in attribute values. (It would take another " or a > inside one of the real attribute values to make it realise the mistake, I think.)

Unfortunately this is quite a tricky error to fix up after. You could try a slightly different parser, perhaps? eg. try Soup 3.0.x instead of 3.1.x if you're using that version, or vice-versa. Or try html5lib.

bobince 2009-11-09 01:47:37

@bobince, good spotting! I gave up on that mess of pottage too soon after managing to run html5lib on it, it seems -- didn't spot the very early error of the extra doublequote. +1 for hawk eyes!-)

Alex Martelli 2009-11-09 02:20:44

That's why Beautiful Soup has a "Markup Massage" feature. You can provide an RE to spot this specific problem and repair the damaged quotes.

S.Lott 2009-11-09 03:20:21

To be honest I don't know how browsers' parsers are managing to cope with it!

bobince 2009-11-09 09:51:25

@bobince: The browsers have huge, complex error fallbacks. Usually they skip tags they can't process. The HTML rules (sadly) are to keep trying and display something. IIRC most browsers will attempt to skip past damaged HTML tags and try to resume parsing at the next ">".

S.Lott 2009-11-09 11:07:41

ansaurus

tags:

views:

answers:

BeautifulSoup is omitting body of page

related questions