views:

364

answers:

3

I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first resource in the list. The problem seems to be in image link used to add the resource to a search set. This is where things get cut off, here's the HTML:

<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&amp;amp;resource=XYZ00618&amp;amp;z122_key=000000000&amp;amp;function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
    <img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>

And here is my python code:

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&amp;mode=titles&amp;scan_start=latp&amp;scan_utf=D&amp;azlist=Y&amp;restricted=all")
print BeautifulSoup(page).prettify

In BeautifulSoup's version, the opening <a href...> shows up, but the <img> doesn't, and the <a> is immediately closed, as are the rest of the open tags, all the way to </html>.

The only distinguishing trait I see for these "add to sets" images is that they are the only ones to have name and id attributes. I can't see why that would cause BeautifulSoup to stop parsing immediately, though.

Note: I am almost entirely new to Python, but seem to be understanding it all right.

Thank you for your help!

A: 

If I remember correctly, BeautifulSoup uses "name" in it's tree as the name of the tag. In this case "a" would be the "name" of the anchor tag.

That doesn't seem like it should break it though. What version of Python and BS are you using?

Trey Stout
+3  A: 

I was using Firefox's "view selection source", which apparently cleans up the HTML for me. When I viewed the original source, this is what I saw

<img name="myImageXYZ00618" id="myImageXYZ00618" src='http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png' alt='Add to My Sets' title='Add to My Sets' border="0"title="Add to clipboard PAIS International (CSA)" alt="Add to clipboard PAIS International (CSA)">

By putting a space after the border="0" attribute, I can get BS to parse the page.

Tim
+1  A: 

I strongly recommend using html5lib + lxml instead of beautiful soup. It uses a real HTML parser (very similar to the one in Firefox) and lxml provides a very flexible way to query the resulting tree (css-selectors or xpath).

There are tons of bugs or strange behavior in BeautifulSoup which makes it not the best solution for a lot of HTML markup you can't trust.

Armin Ronacher
Thanks for the tip. I'll give it a try if BeautifulSoup gives me more trouble.
Tim