tags:

views:

92

answers:

2

I am trying to compare 2 strings but i just realized that one has some html formatting already.

How can i get these two strings to match when doing string1 == string2. (NOTE: i dont know what the HTML formatting is going to be upfront)

string1 = "This is a test";
string1 = "<font color=\"black\" size=\"1\">This is a test</font>";
+6  A: 

Load the html into Html Agility Pack, and extract only the text.

string html = "<html><body><div>test</div></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html)
string text = document.DocumentNode.InnerText;

This will not remove the content of <script> nodes, but you can easily remove the script nodes first.

Mikael Svenson
Obligatory link - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Oded
@Mikael Svenson - how do extract only the text using the HTML Agility Pack ?
ooo
@ooo: I added a sample on how to do it.
Mikael Svenson
HtmlAgilityPack is pretty awesome for parsing HTML.
alimbada
@Mikael Svenson - can i simply shove in <font color=\"black\" size=\"1\">This is a test</font> into this code above as it doesn't seem to work. hw.Load() seems to be looking for a URL as the parameter
ooo
@ooo use HtmlNode.CreateNode(myHtmlString) to create a new HtmlNode and then get the InnerText of the instance
alimbada
@Mikael Svenson - thanks . .this worked but i found one issue.. i put that issue in question here: http://stackoverflow.com/questions/3425554/does-the-html-agility-pack-work-on-internal-text
ooo
Fixed my code sample. Sorry about that.
Mikael Svenson
Are you sure that `InnerText` removes *all* HTML tags, rather than just the outermost pair?
Timwi
Yes. Just like it will in the DOM object on a webpage (except innerText is not cross platform javascript). InnerText gives back all text inside the container you start from. <body><div>lala</div><div>lala</div></body> gives the same result as <body><div>lala<div>la<span>la</span></div></div></body>
Mikael Svenson
A: 

Check out system.web.Httputility.HTMLdecode

ppshein
This will not let you extract only the text, or remove your tags.
Mikael Svenson