views:

64

answers:

1

I need to extract text from an HTML file using C#. I am trying to use HTMLAgilityPack but I am seeing some parse errors (tags not closed). I am using these two options:

        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.OptionAutoCloseOnEnd = true;

Is there any "Fix all" type option. I don't care about the errors, I just want the content or close.

+2  A: 

Maybe this is workaround but once I had to extract text from HTML I used regex:

result = Regex.Replace(result, @"<(.|\n)*?>", String.Empty);
result = Regex.Replace(result, @"^\n*", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"\n*$", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = result.Replace("\n", " ");
Ichibann
Thanks! I was looking for a more HTMLAgilityPack solution...
tvr