ansaurus

Question

How to use BeautifulSoup to extract from within a HTML paragraph?

Answer 1

A:

The following should work...:

htm = '''<p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m
</font>&nbsp; <font class="info"><a href="/nmc/eng.php" class="usg">English
</a>, <a href="/nmc/jew.php" class="usg">Hebrew</a></font><br />
Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>'''

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(htm)

for p in soup.findAll('p'):
  firsta = True
  shortf = False
  for c in p.recursiveChildGenerator():
    if isinstance(c, BeautifulSoup.NavigableString):
      if 'Short form of' in str(c):
        shortf = True
    elif c.name == 'a':
      if firsta or shortf:
        print c.renderContents()
        firsta = shortf = False

Alex Martelli 2010-07-03 00:57:55

Thank you Alex, that worked with a few modifications.

2010-07-04 19:59:41

@kartiku, you're welcome, and, glad to hear this!

Alex Martelli 2010-07-04 20:03:42

Answer 2

A:

You can use pyparsing as a sort of "super-regex" for parsing through HTML. You can put together a simple matching pattern by assembling the various starting and ending tags, without tripping over the typical regex HTML scraping pitfalls (unpredictable tag/attribute letter case, unpredictable attributes, attributes out of order, unpredictable whitespace). Then pattern.scanString will return a generator that will scan through the HTML source and return tuples of the matched tokens, the starting, and ending locations. Throw in the assignment of results names (similar to named fields in regex), and accessing the individual fields of interest is simple.

html = """<some leading html>
<p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m</font> &nbsp; 
<font class="info"><a href="/nmc/eng.php" class="usg">English</a>, <a href="/nmc/jew.php" class="usg">
Hebrew</a></font><br />Short form of <a href="/name/abraham" class="nl">ABRAHAM</a>
<some trailing html>"""

from pyparsing import makeHTMLTags, SkipTo, Optional

pTag,pEnd = makeHTMLTags("P")
bTag,bEnd = makeHTMLTags("B")
aTag,aEnd = makeHTMLTags("A")
fontTag,fontEnd = makeHTMLTags("FONT")
brTag = makeHTMLTags("BR")[0]
nbsp = "&nbsp;"

nickEntry = (pTag + bTag + aTag + SkipTo(aEnd)("nickname") + aEnd + bEnd + Optional(nbsp) + 
            fontTag + SkipTo(fontEnd) + fontEnd + Optional(nbsp) +
            fontTag + aTag + SkipTo(aEnd) + aEnd + "," +
            aTag + SkipTo(aEnd) + aEnd + fontEnd + 
            brTag + "Short form of" +
            aTag + SkipTo(aEnd)("fullname") + aEnd)

for match,_,_ in nickEntry.scanString(html):
    print match.nickname, "->", match.fullname

prints:

ABE -> ABRAHAM

Paul McGuire 2010-07-03 02:47:53

ansaurus

tags:

views:

answers:

How to use BeautifulSoup to extract from within a HTML paragraph?

related questions