views:

453

answers:

2

This is a soup from a WordPress post detail page:

content = soup.body.find('div', id=re.compile('post'))
title = content.h2.extract()
item['title'] = unicode(title.string)
item['content'] = u''.join(map(unicode, content.contents))

I want to omit the enclosing div tag when assigning item['content']. Is there any way to render all the child tags of a tag in unicode? Something like:

item['content'] = content.contents.__unicode__()

that will give me a single unicode string instead of a list.

+4  A: 

Have you tried:

unicode(content)

It converts content's markup to a single Unicode string.

Edit: If you don't want the enclosing tag, try:

content.renderContents()
Ayman Hourieh
Yes, but I want to omit the enclosing tag. Outer DIV in this case.
muhuk
muhuk
I checked the docs and the source code, but I couldn't find a method that returns the contents as a single Unicode object. renderContents() returns a string encoded in UTF-8. So converting the result of renderContents() to a Unicode object is the best approach I can find.
Ayman Hourieh
Yes, unicode/renderContents combo certainly beats my map/unicode combo.
muhuk
A: 

Best method I can come up with:

unicode(''.join(content.contents))
muhuk