views:

918

answers:

2

Hi,

I am able to read emails in from Microsoft Exchange using an IMAP Client from Lumisoft. I have set the exchange server settings to convert any mail to plain text. However, when I read in the information it still seems to contain HTML/CSS.

What is the best way of removing HTML/CSS from the body of an email? Or is there a setting on the exchange server I seemed to have missed?

A: 

I'm not sure of exactly how your setup works, if you can run scripts, etc. An HTML parser would be the best way to parse the HTML, obviously. For instance, with Hpricot (a Ruby HTML-parsing library), you could do puts doc.find_element('body').inner_text and that would print the text content of the document.

Chuck
Hi this pretty much sounds like a solution I could use. How and where would I run a script like this?
James
The link for Hpricot is http://wiki.github.com/why/hpricot.You will need the Ruby programming language to run it http://www.ruby-lang.org/en/.
toby
Hi, I have decided against this method as I don't really have a lot of experience with Ruby.
James
A: 

I usually take one of these approaches...

  1. Using regular expressions. It can be a bit difficult to get right if you have to come up with a solution that also works with all kinds of invalid markup, but i bet someone else has done it before you (Hint: google or search SO).

  2. Using an HTML parser library. You can find one for any popular programming language out there. I recommend using the Html Agility Pack.

Hi, at the minute I am using a regular expression that I created myself and it only strips out the HTML (which leaves the CSS) I don't feel 100% comfortable using this approach tho. I would ideally like an exchange server setting that would definitively convert any mail I receive to a specific mailbox as plain text. I tried setting the IMAP settings for the mailbox to plain text only.....it worked for a while and then has all of a sudden stopped!
James
Decided to go with the HtmlAgilityPack library.
James