views:

248

answers:

4

Hey,

Here's a piece of HTML code (from delicious):

<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers &amp; Anti-Bot Protection</a>
<span class="saverem">
  <em class="bookmark-actions">
    <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&amp;jump=%2Fdux&amp;key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&amp;original_user=dux&amp;copyuser=dux&amp;copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
  </em>
</span>
</h4>

I'm trying to find all the links where class="inlinesave action". Here's the code:

sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)

But it doesn't find anything!

Any thoughts?

Thanks

+1  A: 

If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:

tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})

Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").

The following statement should work for all cases (even though it looks ugly imo.):

soup.findAll('a', 
    attrs={'class': 
        re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
    })
Haes
It only works if the regex matches exactly, so this isn't the way I'd go, personally (what if there's an extra space between the classes, what if there's another class between them etc). Could probably tighten it up a bit to match all likely cases though.
Dominic Rodger
You're right, I edited the answer accordingly.
Haes
This is described as a bug in https://bugs.launchpad.net/beautifulsoup/+bug/410304. Maybe we can have a fix in the future?
GmonC
A: 

Python string methods

html=open("file").read()
for item in html.split("<strong>"):
    if "class" in item and "inlinesave action" in item:
        url_with_junk = item.split('href="')[1]
        m = url_with_junk.index('">') 
        print url_with_junk[:m]
A: 

May be that issue is fixed in verion 3.1.0, I could parse yours,

>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
...   <em class="bookmark-actions">
...     <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Gen
...   </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Generate%20Secure%
>>>

I have tried with BeautifulSoup 2.1.1 also, its does not work at all.

S.Mark
A: 

You might make some forward progress using pyparsing:

from pyparsing import makeHTMLTags, withAttribute

htmlsrc="""<h4>... etc."""

atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))

for result in atag.searchString(htmlsrc):
    print result.href

Gives (long result output snipped at '...'):

/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Genera...+anonymous+links
Paul McGuire