views:

49

answers:

2

I'm trying to translate an online html page into text.

I have a problem with this structure:

<div align="justify"><b>Available in  
<a href="http://www.example.com.be/book.php?number=1"&gt;
French</a> and 
<a href="http://www.example.com.be/book.php?number=5"&gt;
English</a>.
</div>

Here is its representation as a python string:

'<div align="justify"><b>Available in  \r\n<a href="http://www.example.com.be/book.php?number=1"&gt;\r\nFrench&lt;/a&gt;; \r\n<a href="http://www.example.com.be/book.php?number=5"&gt;\r\nEnglish&lt;/a&gt;.\r\n&lt;/div&gt;'

When using:

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.text

BeautifulSoup translate it (in the 'txt' variable) as:

u'Available inFrenchandEnglish.'

It probably strips each line in the original html string.

Do you have a clean solution about this problem ?

Thanks.

A: 

I got a solution:

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.getText(separator=' ')

But it's not optimal because it puts spaces between each tag:

u'Available in French and English .  '

Notice the space before the dot.

Oli
A: 

I finally got a good solution:

def clean_line(line):
    return re.sub(r'[ ]{2,}', ' ', re.sub(r'[\r\n]', '', line))

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
''.join([clean_line(line) for line in para.findAll(text=True)])

Which outputs:

u'Available in French and English.  '
Oli