Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it.
I've tried:
for textarea in soup.findAll('textarea'):
contents = BeautifulSoup.BeautifulSoup(textarea.contents)
textarea.replaceWith(contents.html(text=True))
But I'm getting errors. I can't find this in the documentation, and the alternative parsers aren't helping. Anyone know how I can parse the textareas as HTML?
Edit:
Sample HTML is:
<textarea class="ks-lazyload-custom">
<div class="product-view product-view-rug">
Foobar Womble
<div class="product-view-head">
<img src="tps/i1/fo-25.gif" />
</div>
</div>
</textarea>
Error is:
File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913,
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer
I'm looking for a way of taking an element, extracting the contents, parsing them with BeautifulSoup, collapsing it to text, and then replacing the contents of the original element (or replacing the whole element) with that text.
As for real world vs specs, it actually isn't particularly relevant here. The data needs to be parsed, I'm looking for the way to do so.