ansaurus

Question

how can i grab CData out of BeatuifulSoup

Answer 1

A:

You could try this:

from BeautifulSoup import BeautifulSoup

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]

That should give you the contents of cdata.

Update

This may be a little cleaner:

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

Just personal preference, but I like the bottom one a little better.

RJ Regenold 2010-01-09 03:19:21

thanks for the response, this website is a vast wealth of knowledge

hary wilke 2010-01-09 12:50:24

Answer 2

+1 A:

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

Alex Martelli 2010-01-09 03:31:41

thanks. this will do nicely, it even cleaned off the starting and end <![CDATA //]]> bits. i had tried BeautifulSoup.CData before, but it didn't work for me. I was getting the following error: "AttributeError: class BeautifulSoup has no attribute 'CData'"guess i needed "import BeautifulSoup" instead of "from BeautifulSoup import BeautifulSoup".

hary wilke 2010-01-09 12:47:04

@hary, yes, this kind of thing is part of why I recommend always importing the module (`import BeautifulSoup`) rather than bits and pieces from within it!-)

Alex Martelli 2010-01-09 15:25:52

ansaurus

tags:

views:

answers:

how can i grab CData out of BeatuifulSoup

related questions