views:

96

answers:

1

Consider this piece of code:

<tr>
                                                <td valign=top class="tim_new"><a href="/stocks/company_info/pricechart.php?sc_did=MI42" class="tim_new">3M India</a></td>
                                                <td class="tim_new" valign=top><a href='/stocks/marketstats/indcomp.php?optex=NSE&indcode=Diversified' class=tim>Diversified</a></td>

I want to write a piece of code using HTMLAgility pack which would extract the link in the first line.

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace WebScraper
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml("http://theurl.com");
            try
            {
                var links = doc.DocumentNode.SelectNodes("//td[@class=\"tim_new\"]");

            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
                Console.ReadKey();
            }

        }
    }
}

When I try to insert a foreach(var link in links) statement/loop inside the try block, a runtime error is thrown.

+1  A: 

The code doc.LoadHtml("http://theurl.com"); will not work. The parameter to LoadHtml should be a string containing HTML, not a URL. You must first fetch the HTML document before trying to parse it.

Once you have the document loaded, for this specific example you can use this:

IEnumerable<string> links = doc.DocumentNode
                               .SelectNodes("//a[@class='tim_new']")
                               .Select(n => n.Attributes["href"].Value);
Mark Byers
Mark, I want to extract the links specifically with class="tim_new". You see there are many links in the said html
Soham
If instead of `href` in the attributes, can I parse for `"tim_new"` ?
Soham
@Soham: I've updated my answer, but note that your problem is mostly to do with the way that you are loading the document rather than with how you are parsing it.
Mark Byers
Yes Mark, I have taken note of that and I have changed the code. Thanks for the help. I am not sure, if my query was proper. Can you comment more on that part? Thanks for the piece of code.
Soham
@Sohan: Your XPath query was OK. Note that it selects the entire `td` element, not just the link. In the query in my answer I select only the `href` attribute.
Mark Byers
Oh Okay, thanks. I needed to extract only the link. Moreover, how much time does HTMLWeb.Load() take? Does it take a lot of time? The HTML file when I save it as text it comes as around 624KB
Soham
@Soham: Actually I didn't know about the `HtmlWeb` class... thanks for teaching me something new. :) I don't know how long it takes. I would imagine that it depends mostly on your connection speed.
Mark Byers
Hmm...Mark, one more thing,does the program close cleanly? I.e does it have to do some housekeeping?
Soham
@Soham: It's hard to say when I can't see your full program, but the code that you've posted looks fine to me. The main thing is to close any open network connections. I think HtmlWeb.Load does this for you.
Mark Byers
Mark, thanks for this, can you suggest me some good reference from where I can read up about XPath and DOM nodes. I am finding it difficult to understand the various features of HTMLAgilityPack
Soham