Hello,
I'm using BeautifulSoup to do some screen-scraping. My problem is this: I need to extract specific things out of a paragraph. An example:
<p><b><a href="/name/abe">ABE</a></b> <font class="masc">m</font> <font class="info"><a href="/nmc/eng.php" class="usg">English</a>, <a href="/nmc/jew.php" class="usg">Hebrew</a></font><br />Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>
Out of this paragraph, I'm able to extract the name ABE as follows:
for pFound in soup.findAll('p'):
print pFound
#will get the names
x = pFound.find('a').renderContents()
print x
Now my problem is to extract the other name as well, in the same paragraph.
Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>
I need to extract this only if the tag a is preceded by the text "Short form of"
Any ideas on how to do this? There are many such paragraphs in the HTML page, and not all of them have the text "Short form of" They might contain some other text in that place.
I think that some combination of regex and findNext() may be useful, but i'm not familiar with BeautifulSoup. Ended up wasting quite a lot of time.
Any help would be appreciated. Thanks.