views:

58

answers:

1

I want to get correctly delimited text out of BeautifulSoup, turning tags into whitespace if necessary. The problem is that newlines are collapsed and tags like <br/> are not rendered as whitespace.

<div class="companyInfo">
    <p class="identInfo">
        <acronym title="Standard Industrial Code">
            SIC
        </acronym>
        :
        <a href="/?SIC=3674">
            3674
        </a>
        - SEMICONDUCTORS &amp; RELATED DEVICES
        <br />
        State location: CA
    </p>
</div>

If I run BeautifulSoup(sampleHTML).text I get the following:

u'SIC:3674- SEMICONDUCTORS &amp; RELATED DEVICESState location: CA'

I would like to get something that treats the whitespace correctly, like this:

u'SIC : 3674 - SEMICONDUCTORS &amp; RELATED DEVICES State location: CA'

Any suggestions? Thanks!

A: 

I ended up using content method to get at the information I want from the various nodes. This turned out to be better than using the text method because it obviated the need for some of the text parsing.

So, in conclusion, use the content method or follow the link that Jouni left and check out the answers there.

J. Frankenstein