views:

278

answers:

2

Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it.

I've tried:

    for textarea in soup.findAll('textarea'):
        contents = BeautifulSoup.BeautifulSoup(textarea.contents)
        textarea.replaceWith(contents.html(text=True))

But I'm getting errors. I can't find this in the documentation, and the alternative parsers aren't helping. Anyone know how I can parse the textareas as HTML?

Edit:

Sample HTML is:

<textarea class="ks-lazyload-custom">
  <div class="product-view product-view-rug">
    Foobar Womble
    <div class="product-view-head">
      <img src="tps/i1/fo-25.gif" />
    </div>
  </div>
</textarea>

Error is:

File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, 
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer

I'm looking for a way of taking an element, extracting the contents, parsing them with BeautifulSoup, collapsing it to text, and then replacing the contents of the original element (or replacing the whole element) with that text.

As for real world vs specs, it actually isn't particularly relevant here. The data needs to be parsed, I'm looking for the way to do so.

A: 

I'm now using the following code which mostly works. Your milage may vary.

def _extractText(self, data, encoding):
    if self.isDebug: self._output("_extractText")
    soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding)
    comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment))
    [comment.extract() for comment in comments]
    [script.extract() for script in soup.findAll('script')]
    [css.extract() for css in soup.findAll('style')]
    for textarea in soup.findAll('textarea'):
        textarea.string = self._extractText(textarea.renderContents(), 'UTF-8')
    text = unicode('')
    for line in soup.findAll(text=True):
        line = line.replace('&nbsp;', ' ').strip()  
        if line == '': continue
        if line.startswith('doctype'): continue
        if line.startswith('DOCTYPE'): continue
        text = text + line + '\n'
    return text
brofield
+1  A: 

This seems to work fairly well (if I correctly understood what you wanted):

for textarea in soup.findAll('textarea'):
    contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents()
    textarea.replaceWith(contents)
Justin Peel
Thanks, that does seem to do what I am after.
brofield