views:

1063

answers:

5

My application collects HTML content provided by internal users that is used to dynamically build articles on company web site.

I want to implement a feature whereby users can surround a word/phrase in the HTML content with a special tag called <search>....</search> and when the content is saved in the database, the application will convert <search>WORD/PHRASE</search> to say www.google.com/?q=WORD/PHRASE after encoding the word or phrase.

I think regular expressions can be used to achieve this functionality but need some guidance on how to go about it since there could be more than one <search>....</search> tag in the HTML content.

Any help with examples is appreciated.

+1  A: 

You should consider using an HTML DOM to parse the contents rather then regular expressions. Regexes meant to parse html are notorious for both being complicated and having unexpected bugs.

configurator
can you give an example on how to use html dom for custom tag like the one I want to use?
See DanHerbert's more complete answer
configurator
A: 

Should be pretty easy with greedy matching, assuming you can't nest search tags.

Replacing on

<search>(.*?)</search> is going to be key.

Stefan Kendall
+1  A: 

You might try

Regex.Replace(strMyHtmlInputString, "\<search\>(.+?)\<\/search\>", "www.google.com/?q=\1")

The question mark in the first grouping means "group as little as possible to match this group".

Mike
+1  A: 

Something like this should work:

string data = @"some text <search>search term 1</search> some more text <search>another search term</search>";
Console.WriteLine(Regex.Replace(data, @"(?:<search>)(.*?)(?:</search>)", @"<a href=""http://www.google.com/?q=$1""&gt;$1&lt;/a&gt;"));
Fredrik Mörk
This works perfectly, can you it in reverse? convert the <a link to search tag? Let's say we encode the <a tag with a special attribute (say, class="searchterm") for the purpose of matching.
+1  A: 

Regular Expressions are bad at handing XML/HTML data. You're better off using a real HTML or XML reading API. Regular Expressions run into problems when you're dealing with HTML that has nested tags within it, for example.

If you're getting tag-soup HTML, which you most likely are, you won't be able to use .NET's native XmlDocument class without a lot of stress. You should look into the HtmlAgilityPack, which has an API exactly like the XmlDocument's, but it includes some HTML specific things such as cleaning up HTML to be well-formed.

This example uses the XmlDocument class, but using the HtmlAgilityPack's HtmlDocument should be very similar (only using an HtmlDocument instead of an XmlDocument). This should replace the <search /> tag with the link to Google.

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
XmlNode searchTag = doc.SelectSingleNode("//search");
XmlElement linkTag = doc.CreateElement("a");
linkTag.InnerXml = searchTag.InnerXml;
linkTag.Attributes["href"].Value = "http://google.com/?q=" + linkTag.InnerText;
searchTag.ParentNode.ReplaceChild(searchTag, linkTag);

Disclaimer: I have not tested this example code above, but it should work.

Dan Herbert