ansaurus

Question

Why is Beautiful Soup truncating this page?

Answer 1

A:

If I remember correctly, BeautifulSoup uses "name" in it's tree as the name of the tag. In this case "a" would be the "name" of the anchor tag.

That doesn't seem like it should break it though. What version of Python and BS are you using?

Trey Stout 2009-03-21 03:59:26

Answer 2

+3 A:

I was using Firefox's "view selection source", which apparently cleans up the HTML for me. When I viewed the original source, this is what I saw

<img name="myImageXYZ00618" id="myImageXYZ00618" src='http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png' alt='Add to My Sets' title='Add to My Sets' border="0"title="Add to clipboard PAIS International (CSA)" alt="Add to clipboard PAIS International (CSA)">

By putting a space after the border="0" attribute, I can get BS to parse the page.

Tim 2009-03-21 06:34:15

Answer 3

+1 A:

I strongly recommend using html5lib + lxml instead of beautiful soup. It uses a real HTML parser (very similar to the one in Firefox) and lxml provides a very flexible way to query the resulting tree (css-selectors or xpath).

There are tons of bugs or strange behavior in BeautifulSoup which makes it not the best solution for a lot of HTML markup you can't trust.

Armin Ronacher 2009-03-23 19:19:48

Thanks for the tip. I'll give it a try if BeautifulSoup gives me more trouble.

Tim 2009-03-24 00:51:16

ansaurus

tags:

views:

answers:

Why is Beautiful Soup truncating this page?

related questions