views:

165

answers:

2

Hello,

I'm using BeautifulSoup to do some screen-scraping. My problem is this: I need to extract specific things out of a paragraph. An example:

<p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m</font> &nbsp; <font class="info"><a href="/nmc/eng.php" class="usg">English</a>, <a href="/nmc/jew.php" class="usg">Hebrew</a></font><br />Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>

Out of this paragraph, I'm able to extract the name ABE as follows:

for pFound in soup.findAll('p'):

    print pFound


#will get the names
    x = pFound.find('a').renderContents()
    print x

Now my problem is to extract the other name as well, in the same paragraph.

Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>

I need to extract this only if the tag a is preceded by the text "Short form of"

Any ideas on how to do this? There are many such paragraphs in the HTML page, and not all of them have the text "Short form of" They might contain some other text in that place.

I think that some combination of regex and findNext() may be useful, but i'm not familiar with BeautifulSoup. Ended up wasting quite a lot of time.

Any help would be appreciated. Thanks.

A: 

The following should work...:

htm = '''<p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m
</font>&nbsp; <font class="info"><a href="/nmc/eng.php" class="usg">English
</a>, <a href="/nmc/jew.php" class="usg">Hebrew</a></font><br />
Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>'''

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(htm)

for p in soup.findAll('p'):
  firsta = True
  shortf = False
  for c in p.recursiveChildGenerator():
    if isinstance(c, BeautifulSoup.NavigableString):
      if 'Short form of' in str(c):
        shortf = True
    elif c.name == 'a':
      if firsta or shortf:
        print c.renderContents()
        firsta = shortf = False
Alex Martelli
Thank you Alex, that worked with a few modifications.
@kartiku, you're welcome, and, glad to hear this!
Alex Martelli
A: 

You can use pyparsing as a sort of "super-regex" for parsing through HTML. You can put together a simple matching pattern by assembling the various starting and ending tags, without tripping over the typical regex HTML scraping pitfalls (unpredictable tag/attribute letter case, unpredictable attributes, attributes out of order, unpredictable whitespace). Then pattern.scanString will return a generator that will scan through the HTML source and return tuples of the matched tokens, the starting, and ending locations. Throw in the assignment of results names (similar to named fields in regex), and accessing the individual fields of interest is simple.

html = """<some leading html>
<p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m</font> &nbsp; 
<font class="info"><a href="/nmc/eng.php" class="usg">English</a>, <a href="/nmc/jew.php" class="usg">
Hebrew</a></font><br />Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>
<some trailing html>"""

from pyparsing import makeHTMLTags, SkipTo, Optional

pTag,pEnd = makeHTMLTags("P")
bTag,bEnd = makeHTMLTags("B")
aTag,aEnd = makeHTMLTags("A")
fontTag,fontEnd = makeHTMLTags("FONT")
brTag = makeHTMLTags("BR")[0]
nbsp = "&nbsp;"

nickEntry = (pTag + bTag + aTag + SkipTo(aEnd)("nickname") + aEnd + bEnd + Optional(nbsp) + 
            fontTag + SkipTo(fontEnd) + fontEnd + Optional(nbsp) +
            fontTag + aTag + SkipTo(aEnd) + aEnd + "," +
            aTag + SkipTo(aEnd) + aEnd + fontEnd + 
            brTag + "Short form of" +
            aTag + SkipTo(aEnd)("fullname") + aEnd)

for match,_,_ in nickEntry.scanString(html):
    print match.nickname, "->", match.fullname

prints:

ABE -> ABRAHAM
Paul McGuire