tags:

views:

7356

answers:

14

I'm looking for C# code to convert an HTML document to plain text.

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

The output should look like this:

Html2Txt at W3C

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

+2  A: 

What you are looking for is a textmode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do that you would expect.

FlySwat
Ha! I forgot about Lynx! Appentely, it's still maintained as a Windows version (http://fredlwm.iblogger.org/cygwin/lynx/).
Matt
Nope, it actually makes it easier!! (see question edit). Thanks again!
Matt
There's one called "Links" spelled properly too... http://links.sourceforge.net/
Mark
+1  A: 

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

EricSchaefer
good thinking, it actually pretty easy to do a rough version.
DrG
Well it depends on the HTML. I wrote a quick and dirty version of this approach in php for a CMS that was sending weekly digest of the post by plain text email. In this case the editor for the posts only allowed certain HTML elements. It should be much harder if full HTML transitional is allowed.
EricSchaefer
+3  A: 

Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.

DrG
+1  A: 

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

jw
This is closer to what i'm looking for, but this still "flattened" the html tables. :(
Matt
+4  A: 

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

http://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables

madcolor
thanks Madcolor, i'll give this a try.
Matt
Good Luck.. I don't pretend that it's going to be easy.. but I think it's the correct path to go down.
madcolor
+1  A: 

Another post suggests the HTML agility pack:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

crb
A: 

Not a C# solution, but you might want to take a look at this. A HTML to Markdown converter written in PHP. This, of course, doesn't convert all tags.

çağdaş
+1  A: 

You could use this:

 public static string StripHTML(string HTMLText)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
Richard
A: 

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

Brian Genisio
A: 

This is another solution to convert HTML to Text or RTF in C#:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product.

Maximus
Max, be clear that this is your product that you are recommending. All of your answers IIRC are you suggesting this product. The SO community is pretty protective and sensitive to spamming/astroturfing. If you are not clear, and if all you do here is suggest people buy your software, you are going to end up doing yourself more harm than good.
Will
Hi Will! Yes, it's my product - you are right, sorry that this post looks like a advertising. I'll change it right now to make it wihtout any advertising.
Maximus
A: 

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

ProNotion
A: 

Nice idea with the Lynx! Have you tested out that solution with a high user volume (load test)?

If you already have a C# HTML editor, like CuteEditor, it may have C# classes you can use to do the conversion you are trying to accomplish.

Michael Ritchson
I have not tested this solution under load. Since it will call an out-of-process exe, i know it won't be as fast as if i ported Lynx to C#! :) It works well for my situation - a desktop application that calls the conversion routine interactively.
Matt
A: 

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

M.T.
A: 

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;