I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process
class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false
and ProcessStartInfo.RedirectStandardOutput = true
. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!