views:

43

answers:

2

I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?

from BeautifulSoup import BeautifulSoup
import re

html = """
<div class="entry">
    <a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/"&gt;RT&lt;/a&gt;
    <a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/"&gt;Trailer&lt;/a&gt; &ndash; 
    <a target="_blank" href="http://www.imdb.com/title/tt1196141/"&gt;IMDB&lt;/a&gt; &ndash; 
</div>
"""

soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']

I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?

Thanks.

A: 

Can't answer your question, but anyway your (originally) posted code has an import typo. Change

import BeautifulSoup

to

from BeautifulSoup import BeautifulSoup

Then, your output (using beautifulsoup version 3.1.0.1) will be:

http://www.imdb.com/title/tt1196141/
The MYYN
My bad. When testing it out on my computer i had BS in a different location and when I copy-pasted the coded here, I modified the `import` hurriedly and therefore the typo. I'll make the edit. The problem still persists though. It doesn't give me the correct link.
Mridang Agarwalla
+1  A: 

find only returns the first <a> tag. You want findAll.

katrielalex