ansaurus

Question

How to prevent BeautifulSoup from stripping lines

Answer 1

A:

I got a solution:

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.getText(separator=' ')

But it's not optimal because it puts spaces between each tag:

u'Available in French and English .  '

Notice the space before the dot.

Oli 2010-06-07 09:16:07

Answer 2

A:

I finally got a good solution:

def clean_line(line):
    return re.sub(r'[ ]{2,}', ' ', re.sub(r'[\r\n]', '', line))

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
''.join([clean_line(line) for line in para.findAll(text=True)])

Which outputs:

u'Available in French and English.  '

Oli 2010-06-07 09:30:06

ansaurus

tags:

views:

answers:

How to prevent BeautifulSoup from stripping lines

related questions