tags:

views:

168

answers:

1

We are retrieving mails from our gmail account using IMAP4_SSL and python. The email body is retrieved in html format. We need to convert that to plaintext. Can anyone help us with that?

+2  A: 

Stand on the shoulders of giants...
Peter Bengtsson has worked out a solution to this exact problem here.
Peter's script uses the awesome BeautifulSoup, by Leonard Richardson,
and Fredrik Lundh's unescape() function.

Using Peter's test case, you get this:

This is a paragraph.

Foobar [1]
http://two.com

Visit http://www.google.com.

Text elsewhere. Elsewhere [2]

[1] http://one.com
[2] http://three.com

...from this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<body>

<div id="main">
<p>This is a paragraph.</p>

<p><a href="http://one.com"&gt;Foobar&lt;/a&gt;
<br />

<a href="http://two.com"&gt;two.com&lt;/a&gt;

</p>
  <p>Visit <a href="http://www.google.com"&gt;www.google.com&lt;/a&gt;.&lt;/p&gt;
<br />
Text elsewhere.

<a href="http://three.com"&gt;Elsewhere&lt;/a&gt;

</div>
</body>
</html>
Adam Bernier