ansaurus

Question

Parsing HTML document: Regular expression or LINQ?

Answer 1

+10 A:

Neither. Load it into an (X/HT)MLDocument and use XPath, which is a standard method of manipulating XML and very powerful. The functions to look at are SelectNodes and SelectSingleNode.

Since you are apparently using HTML (not XHTML), you should use HTML Agility Pack. Most of the methods and properties match the related XML classes.

Sample implementation using XPath:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""&gt;&lt;/div&gt;
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""&gt;&lt;/span&gt;
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""&gt;&lt;/div&gt;
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""&gt;&lt;/div&gt;
<div>Thanks for visiting!</div>
</body>
</html>"));
        HtmlNode root = doc.DocumentNode;
        // 3 = ".txt".Length - 1.  See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
        HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
    IList<string> fileStrings;
    if(links != null)
    {
        fileStrings = new List<string>(links.Count);
        foreach(HtmlNode link in links)
     fileStrings.Add(link.GetAttributeValue("href", null));
    }
    else
        fileStrings = new List<string>(0);

Matthew Flaschen 2009-05-25 18:00:57

@Matthew: The HTML Agility Pack gave me what I needed in about 5 minutes of implementation. It came with samples and source. Kudos to Simon Mourier!

p.campbell 2009-05-25 18:24:33

There's also now some support for "LINQ to HTML" in the Agility pack.

Pete Montgomery 2010-07-02 14:29:23

Answer 2

+1 A:

I would recommend regex. Why?

Flexible (case-insensitivity, easy to add new file extensions, elements to check, etc.)
Fast to write
Fast to run

Regex expressions will not be hard to read, as long as you can WRITE regexes.

using this as the regular expression:

href="([^"]*\.txt)"

Explanation:

It has parentheses around the filename, which will result in a "captured group" which you can access after each match has been found.
It has to escape the "." by using the regex escape character, a backslash.
It has to match any character EXCEPT double-quotes: [^"] until it finds
the ".txt"

it translates into an escaped string like this:

string txtExp = "href=\"([^\\\"]*\\.txt)\"

Then you can iterate over your Matches:

Matches txtMatches = Regex.Matches(input, exp, RegexOptions.IgnoreCase);
foreach(Match m in txtMatches) {
  string filename = m.Groups[1]; // this is your captured group
}

Jeff Meatball Yang 2009-05-25 18:25:26

@Jeff: this is an excellent code sample. Thank you for the input!

p.campbell 2009-05-25 18:40:04

That will match .txt anywhere in the href, when the OP explicitly said "ends with". In my opinion, regex is inappropriate here.

Matthew Flaschen 2009-05-25 19:01:45

@Matthew: No, It will only match an HREF ending with (.txt"). I don't think HREF's contain quotes in the middle.

Dmitri Farkov 2009-05-25 19:47:16

Don't try to use regular expressions to parse non-regular languages.

Svante 2009-05-25 20:42:30

I understand the desire to approach this from a DOM/XPath point of view - but my rationale was that a regex implementation assumes very little about the input data. Obviously, if the OP can make assumptions, especially like well-formed documents, a DOM approach is much "cleaner". @Svante: I think regexes are GREAT at finding known patterns out of non-regular data. Think how many times you've grepped for something with a regex. Also, the OP wanted a regex example.

Jeff Meatball Yang 2009-05-26 16:05:12

I misread your regex. However, href's actually can contain "'s, if they are surrounded by '. <a href='foo.txt".html'>foo</a> Granted that is somewhat perverse, but it does validate. Anyway, the important thing is that the OP got something that works well for their current data.

Matthew Flaschen 2009-05-26 21:18:09

Answer 3

A:

Alternatively to Matthew Flaschen's suggestion, DOM (e.g. if you suffer from a X?L allergy outbreak)

It gets a bad rep sometimes - I guess because implementations are funny sometimes, and the native COM interfaces are a bit unwieldy without some (minor) smart helpers, but I've found it a robust, stable and intuitive / explorable way to parse and manipulate HTML.

peterchen 2009-05-25 18:28:13

You're actually suggesting he use IE's HTML parser from .NET via COM interop?....

Matthew Flaschen 2009-05-25 19:00:08

oh wait, he said "C#".... In that case, noooo.

peterchen 2009-05-25 21:34:33

Answer 4

+1 A:

You are assuming that the HTML document will be well-formed. Most webpages you find on the internet probably wouldn't work with XPath, RegEx or LINQ. You need something very forgiving, which none of these are.

Jon Tackabury 2009-05-25 19:26:45

I know I might be advocating a little too much, but I think regex is flexible, and is the only way to get what you are specifically looking for in a sea of variations.

Jeff Meatball Yang 2009-05-28 05:04:39

ansaurus

tags:

views:

answers:

Parsing HTML document: Regular expression or LINQ?

related questions