ansaurus

Question

Answer 1

+2 A:

What you are looking for is a textmode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do that you would expect.

FlySwat 2009-04-08 20:26:23

Ha! I forgot about Lynx! Appentely, it's still maintained as a Windows version (http://fredlwm.iblogger.org/cygwin/lynx/).

Matt 2009-04-09 16:28:39

Nope, it actually makes it easier!! (see question edit). Thanks again!

Matt 2009-04-09 16:39:45

There's one called "Links" spelled properly too... http://links.sourceforge.net/

Mark 2010-09-22 07:06:20

Answer 2

+1 A:

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

EricSchaefer 2009-04-08 20:27:28

good thinking, it actually pretty easy to do a rough version.

DrG 2009-04-08 20:29:00

Well it depends on the HTML. I wrote a quick and dirty version of this approach in php for a CMS that was sending weekly digest of the post by plain text email. In this case the editor for the posts only allowed certain HTML elements. It should be much harder if full HTML transitional is allowed.

EricSchaefer 2009-04-08 20:33:51

Answer 3

+3 A:

Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.

DrG 2009-04-08 20:27:48

Answer 4

+1 A:

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

jw 2009-04-08 20:28:40

This is closer to what i'm looking for, but this still "flattened" the html tables. :(

Matt 2009-04-08 20:50:59

Answer 5

+4 A:

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

http://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables

madcolor 2009-04-08 20:33:14

thanks Madcolor, i'll give this a try.

Matt 2009-04-08 20:51:47

Good Luck.. I don't pretend that it's going to be easy.. but I think it's the correct path to go down.

madcolor 2009-04-08 21:02:17

Answer 6

+1 A:

Another post suggests the HTML agility pack:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

crb 2009-04-08 20:36:18

Answer 7

A:

Not a C# solution, but you might want to take a look at this. A HTML to Markdown converter written in PHP. This, of course, doesn't convert all tags.

çağdaş 2009-04-08 22:14:10

Answer 8

+1 A:

You could use this:

 public static string StripHTML(string HTMLText)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }

Richard 2009-04-08 22:20:01

Answer 9

A:

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

Brian Genisio 2009-04-08 22:20:25

Answer 10

A:

This is another solution to convert HTML to Text or RTF in C#:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product.

Maximus 2009-09-29 11:23:41

Max, be clear that this is your product that you are recommending. All of your answers IIRC are you suggesting this product. The SO community is pretty protective and sensitive to spamming/astroturfing. If you are not clear, and if all you do here is suggest people buy your software, you are going to end up doing yourself more harm than good.

Will 2010-08-12 22:36:41

Hi Will! Yes, it's my product - you are right, sorry that this post looks like a advertising. I'll change it right now to make it wihtout any advertising.

Maximus 2010-08-16 05:37:32

Answer 11

A:

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

ProNotion 2009-11-05 21:01:25

Answer 12

A:

Nice idea with the Lynx! Have you tested out that solution with a high user volume (load test)?

If you already have a C# HTML editor, like CuteEditor, it may have C# classes you can use to do the conversion you are trying to accomplish.

Michael Ritchson 2010-09-03 15:34:24

I have not tested this solution under load. Since it will call an out-of-process exe, i know it won't be as fast as if i ported Lynx to C#! :) It works well for my situation - a desktop application that calls the conversion routine interactively.

Matt 2010-10-13 15:31:38

Answer 13

A:

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

M.T. 2010-09-15 20:12:17

Answer 14

A:

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

2010-09-22 06:52:37

ansaurus

tags:

views:

answers:

How can I Convert HTML to Text in C#?

related questions