ansaurus

Question

Simple parsing of html file for <U></U> values in .net?

Answer 1

A:

If the HTML document is well formed, XPath would be my first choice.

Requested code example (never tested thou);

var doc                    = new XPathDocument (@"path\to\file.html");
XPathNavigator navigator   = doc.CreateNavigator ();
XPathNodeIterator iterator = navigator.Select ("//U");
while (iterator.MoveNext ())
    Console.WriteLine ("U: {0}", iterator.Current.Value);

Björn 2009-10-08 20:53:26

It is well formed with all matchign tags and very basic html. Do you have a sample of usign XPath for this?

schooner 2009-10-08 20:55:47

Answer 2

+3 A:

Definitely Regular expressions:

Dim myPattern As String = "<U>(.*?)</U>"

    For Each thisMatch As Match In System.Text.RegularExpressions.Regex.Matches(myPage1HTML, myPattern,RegexOptions.IgnoreCase)
            Response.write(thisMatch.ToString)
    Next

NickAtuShip 2009-10-08 20:57:05

Here's a good resource:http://www.regular-expressions.info/dotnet.html

NickAtuShip 2009-10-08 20:58:37

-1 for suggesting parsing HTML with regular expressions. See http://www.codinghorror.com/blog/archives/000253.html

TrueWill 2009-10-09 02:54:51

Regex worked fine in my case as the html is very clean and specific to the content each time.

schooner 2009-10-09 09:23:46

Answer 3

A:

XmlNodeList list = doc.SelectNodes("//u");

Gets you the list of U nodes

jitter 2009-10-08 20:58:56

Answer 4

A:

sample for using Xpath with XMLDocument

XmlDocument doc = new XmlDocument();
doc.Load("file.html");

XmlNodeList nodeList = doc.DocumentElement.SelectNodes("//u");
foreach (XmlNode title in nodeList) {
    Console.WriteLine(title.InnerXml);
}

its taken from here

Itsik 2009-10-08 21:01:21

The problem here is its pretty fragile. If there's any html thats not well formed, this wont work.

NickAtuShip 2009-10-08 21:19:48

true, but he specifically wrote the xhtml is well formed in his comment below

Itsik 2009-10-08 21:24:17

Answer 5

A:

Html Agility Pack.

I strongly advise against using regular expressions for parsing HTML. They're a great tool, but they're not suited to this job. HTML is just too complex. As soon as you hit one of the edge cases (embedded tags, nested tags, etc.) you'll see what I mean.

EDIT: See also Coding Horror: Parsing: Beyond Regex

TrueWill 2009-10-09 02:51:09

-1 for overcomplicating simple question

NickAtuShip 2009-10-09 03:11:15

For my needs with this specific HTML regex works perfectly.

schooner 2009-10-09 09:22:38

ansaurus

tags:

views:

answers:

Simple parsing of html file for <U></U> values in .net?

related questions