I want to get correctly delimited text out of BeautifulSoup, turning tags into whitespace if necessary. The problem is that newlines are collapsed and tags like <br/>
are not rendered as whitespace.
<div class="companyInfo">
<p class="identInfo">
<acronym title="Standard Industrial Code">
SIC
</acronym>
:
<a href="/?SIC=3674">
3674
</a>
- SEMICONDUCTORS & RELATED DEVICES
<br />
State location: CA
</p>
</div>
If I run BeautifulSoup(sampleHTML).text I get the following:
u'SIC:3674- SEMICONDUCTORS & RELATED DEVICESState location: CA'
I would like to get something that treats the whitespace correctly, like this:
u'SIC : 3674 - SEMICONDUCTORS & RELATED DEVICES State location: CA'
Any suggestions? Thanks!