views:

695

answers:

4

I use TinyMCE to allow minimal formatting of text within my site. From the HTML that's produced, I'd like to convert it to plain text for e-mail. I've been using a class called html2text, but it's really lacking in UTF-8 support, among other things. I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had <i> tags in the HTML.

Does anyone use a similar approach to converting HTML to plain text in PHP? And if so: Do you recommend any third-party classes that I can use? Or how do you best tackle this issue?

Thanks!

+3  A: 

There's the trusty strip_tags function. It's not pretty though. It'll only sanitize. You could combine it with a string replace to get your fancy underscores.


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>
Pestilence
Don't forget that strip tags also removes anchors!
Alix Axel
A: 

html2text seems reasonnable. I believe you can hack to get our encoding working. I don't really know what encoding is fine for html2text, but if it understands latin1 (iso-8859-1), you really should try to utf8_decode() your string first (so you get a latin1-encoded string), then apply html2text(), and convert back the result to utf8 with utf8_encode()

Savageman
I'd much rather use a method that supports unicode natively than convert between encodings.
Justin Stayton
+1  A: 

I was also looking for a PHP converter so that I could create an automatically good-looking text equivalent of my HTML newsletter. I tried several, and looked at the class you mentioned. Most used preg_replace with regex strings.

But the one I found suitable was version 2 of an html2text PHP script that uses a different technique: a DOMDocument. It worked great for me and that's what I'm now using. It does require PHP5. I would recomment you try it.

With regards to your UTF8 concerns, the writeup on the page I mention talks about it. Specifically it states:

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.

The author then goes on to give you several approaches to solving this, and then says that his version 2 script using the DOMDocument can output utf-8, whereas his version 1 could not.

lkessler
+2  A: 

Since html2text is under a non-commercial license, and lkessler's link also can't be used for commercial sites, here is another html2text script (source) that is licensed under the Eclipse Public License (which can be used commercially).

It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to correctly output plain text. Example output: HTML to text. It can be used like so:

$text = convert_html_to_text($html);

It's not complete yet but it's open source and contributions are welcome.

jevon