tags:

views:

98

answers:

5

Looking for a simple means in .NET to parse an html file to get back all values within <u></u> tags.

Ex: <U>105F</U>

There may be many of these in the file with other tags but all I need is to loop through and get back a list of all the values so they cna then be processed.

Looking for a light small way to handle this.

A: 

If the HTML document is well formed, XPath would be my first choice.

Requested code example (never tested thou);

var doc                    = new XPathDocument (@"path\to\file.html");
XPathNavigator navigator   = doc.CreateNavigator ();
XPathNodeIterator iterator = navigator.Select ("//U");
while (iterator.MoveNext ())
    Console.WriteLine ("U: {0}", iterator.Current.Value);
Björn
It is well formed with all matchign tags and very basic html. Do you have a sample of usign XPath for this?
schooner
+3  A: 

Definitely Regular expressions:

Dim myPattern As String = "<U>(.*?)</U>"

    For Each thisMatch As Match In System.Text.RegularExpressions.Regex.Matches(myPage1HTML, myPattern,RegexOptions.IgnoreCase)
            Response.write(thisMatch.ToString)
    Next
NickAtuShip
Here's a good resource:http://www.regular-expressions.info/dotnet.html
NickAtuShip
-1 for suggesting parsing HTML with regular expressions. See http://www.codinghorror.com/blog/archives/000253.html
TrueWill
Regex worked fine in my case as the html is very clean and specific to the content each time.
schooner
A: 
XmlNodeList list = doc.SelectNodes("//u");

Gets you the list of U nodes

jitter
A: 

sample for using Xpath with XMLDocument

XmlDocument doc = new XmlDocument();
doc.Load("file.html");

XmlNodeList nodeList = doc.DocumentElement.SelectNodes("//u");
foreach (XmlNode title in nodeList) {
    Console.WriteLine(title.InnerXml);
}

its taken from here

Itsik
The problem here is its pretty fragile. If there's any html thats not well formed, this wont work.
NickAtuShip
true, but he specifically wrote the xhtml is well formed in his comment below
Itsik
A: 

Html Agility Pack.

I strongly advise against using regular expressions for parsing HTML. They're a great tool, but they're not suited to this job. HTML is just too complex. As soon as you hit one of the edge cases (embedded tags, nested tags, etc.) you'll see what I mean.

EDIT: See also Coding Horror: Parsing: Beyond Regex

TrueWill
-1 for overcomplicating simple question
NickAtuShip
For my needs with this specific HTML regex works perfectly.
schooner