tags:

views:

181

answers:

2

I have a string that should contain a list of items in the form , {0}, {1}, and {2} are strings and I want to basically extract them.

I do want to do this for part of an html parsing problem, and I have heard that parsing html with regular expressions is bad. (Like here)

I am not even sure how to do this with regular expressions.

This is as far as I got

string format = "<link rel=\".*\" type=\".*\" href=\".*\">";
Regex reg = new Regex(format);
MatchCollection matches = reg.Matches(input, 0);
foreach (Match match in matches)
 {
        string rel = string.Empty;
        string type = string.Empty;
        string href = string.Empty;
        //not sure what to do here to get these values for each from the match
 }

Before my research turned up that I might be completely on the wrong track using regular expressions.

How would you do this either with the method I chose or with an HTML parser?

+1  A: 

parse you HTML using the HTML Agility pack library, which can be found here

Rony
Thanks for the link.
James W
A: 

You'd be better off using a real HTML parser like the Html Agility Pack. You can get it here.

A main reason for not using regular expressions for HTML parsing is because it might not be well-formed (almost always the case), which could break your regular expression parser.

You would then use XPath to get the nodes you need and load them into variables.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(pageMarkup);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//link");
string rel;

if(nodes[0].Attributes["rel"] != null)
{
    rel = nodes[0].Attributes["rel"]; 
}
Dan Herbert
Thanks. I am giving you the check mark because your answer had helpful code, and you explained why to use the parser instead of a regex.Thanks to Rony too for the link to HTML Agility pack, I just downloaded it.
James W