views:

270

answers:

3

How can I make HTML from email safe to display in web browser with python?

Any external references shouldn't be followed when displayed. In other words, all displayed content should come from the email and nothing from internet.

Other than spam emails should be displayed as closely as possible like intended by the writer.

I would like to avoid coding this myself.

Solutions requiring latest browser (firefox) version are also acceptable.

A: 

Use the HTMLparser module, or install BeautifulSoup, and use those to parse the HTML and disable or remove the tags. This will leave whatever link text was there, but it will not be highlighted and it will not be clickable, since you are displaying it with a web browser component.

You could make it clearer what was done by replacing the <A></A> with a <SPAN></SPAN> and changing the text decoration to show where the link used to be. Maybe a different shade of blue than normal and a dashed underscore to indicate brokenness. That way you are a little closer to displaying it as intended without actually misleading people into clicking on something that is not clickable. You could even add a hover in Javascript or pure CSS that pops up a tooltip explaining that links have been disabled for security reasons.

Similar things could be done with <IMG></IMG> tags including replacing them with a blank rectangle to ensure that the page layout is close to the original.

I've done stuff like this with Beautiful Soup, but HTMLparser is included with Python. In older Python distribs, there was an htmllib which is now deprecated. Since the HTML in an email message might not be fully correct, use Beautiful Soup 3.0.7a which is better at making sense of broken HTML.

Michael Dillon
Clickable links are not a problem. Images and other references normally fetched automatically are.
iny
A proper HTML parser is indeed a good start. But be sure to work with a white-list of acceptable tags and their acceptable attributes, and remove everything else. A black-list approach is likely to be easy to get around: there are many more potentially dangerous/external-content-including tags than you think, especially given cross-browser differences. Also if you need to allow styles you have got yourself a difficult CSS-parsing task ahead of you to allow only known-good properties.
bobince
That is why I would prefer an existing solution instead of doing it myself.
iny
+1  A: 

html5lib contains an HTML+CSS sanitizer. It allows too much currently, but it shouldn't be too hard to modify it to match the use case.

Found it from here.

iny
+1  A: 

I'm not quite clear with what exactly you mean with "safe". It's a pretty big topic... but, for what it's worth:

In my opinion, the stripping parser from the ActiveState Cookbook is one of the easiest solutions. You can pretty much copy/paste the class and start using it.

Have a look at the comments as well. The last one states that it doesn't work anymore, but I also have this running in an application somewhere and it works fine. From work, I don't have access to that box, so I'll have to look it up over the weekend.

exhuma
Just confirming that the script indeed does not leave valid tags any more, as the commenter stated on that page.
ropable