tags:

views:

54

answers:

2

Currently I'm trying to parse some html and return an array with the values inside each element.

For example:

if I pass the below markup into a function

var element = "td";
var html = "<tr><td>1</td><td>2</td></tr>";
return Regex.Split(html, string.Format("<{0}*.>(.*?)</{0}>", element));

And I'm expecting back an array[] { 1, 2 }

What does my regex need to look like? Currently my array is coming back with far to many elements and my regex skills are lacking

+6  A: 

Do not parse HTML using regular expressions.

Instead, you should use the HTML Agility Pack.

For example:

HtmlDocument doc = new HtmlDocument();
doc.Parse(str);

IEnumerable<string> cells = doc.DocumentNode.Descendants("td").Select(td => td.InnerText);
SLaks
+1  A: 

You really should not use regex to parse html. html is not a regular language, so regex isn't capable of interpreting it properly. You should use a parser.

c# has html parsers for this.

JoshD