ansaurus

Question

Cannot prettify scraped html in BeautifulSoup

Answer 1

A:

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

Jonathan Feinberg 2010-01-07 16:59:04

I added your line, and it is now giving a type error in BeautifulSoup.py TypeError: expected string or buffer

Kevin 2010-01-07 17:07:16

Answer 2

+1 A:

This works for me:

soup1 = BeautifulSoup(''.join(str(t) for t in tags))

This pyparsing solution gives some decent output, too:

from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine

# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")

# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)

# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))

# extract links from the input html
links = aLink.searchString(html)

# build list of strings for output
out = []
out.append(pre)
out.extend(['  '+lnk[0] for lnk in links])
out.append(post)

print '\n'.join(out)

prints:

<html><head><title>Page title</title></head>
  <a href="http://www.reddit.com/r/pics/" >pics</a>
  <a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
  <a href="http://www.reddit.com/r/politics/" >politics</a>
  <a href="http://www.reddit.com/r/funny/" >funny</a>
  <a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
  <a href="http://www.reddit.com/r/WTF/" >WTF</a>
  .
  .
  .
  <a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
  <a href="http://www.reddit.com/feedback" >volunteer to translate</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
</html>

Paul McGuire 2010-01-07 23:55:27

ansaurus

tags:

views:

answers:

Cannot prettify scraped html in BeautifulSoup

related questions