views:

27

answers:

1

Hi,

I am using the agility pack to do some screens scraping and my code so far to get titles is:

foreach (HtmlNode title in root.SelectNodes("//html//body//div//div//div[3]//div//div//div//div[3]//ul//li[1]//h4"))
        {
            string titleString = "<div class=\"show\">" + title.InnerText + "</div>";
            shows.Add(titleString);
        }

Before the title I want a timestamp related to the title and it has the node

/html/body/div/div/div[3]/div/div/div/div[3]/ul/li[1]/ul/li/span

How can I get this value next to the title? So something like:

string titleString = "<div class=\"show\">" + time.InnerText + " - " + title.InnerText + "</div>";
A: 

Hi Morgan!

Try to get the parent node first and then get both title and timestamp from the parent

        HtmlNodeCollection TvGuideCollection = doc.DocumentNode.SelectNodes(@"//ul[@class='results']//ul//li");
        List<string> shows = new List<string>();
        foreach (HtmlNode item in TvGuideCollection)
        {
            HtmlNode title = item.SelectSingleNode(".//a");
            HtmlNode time = item.SelectSingleNode(".//span[@class='stamp']");
            if (title != null && time != null)
            {
                string titleString = "<div class=\"show\">" + time.InnerText + " - " + title.InnerText + "</div>";
                shows.Add(titleString);
            }
        }

Updated to just get todays shows

            HtmlNode TvGuideToday = doc.DocumentNode.SelectSingleNode(@"//ul[@class='results']//ul");
            List<string> shows = new List<string>();
            foreach (HtmlNode item in TvGuideToday.SelectNodes(".//li")) 
            {
                HtmlNode title = item.SelectSingleNode(".//a");
                HtmlNode time = item.SelectSingleNode(".//span[@class='stamp']");
                if (title != null && time != null)
                {
                    string titleString = "<div class=\"show\">" + time.InnerText + " - " + title.InnerText + "</div>";
                    shows.Add(titleString);
                }
            }
Ole Melhus
Hi Ole, I tried that but it only gets the first node and then an error is thrown "SEHException was unhandled"
Morgan
Have you tried to get the <li> by id or classname instead? Could you please present some source code or an url to the site you are scraping?
Ole Melhus
I am trying to get time and title from http://au.tv.yahoo.com/tv-guide/channel/18891/
Morgan
I've updated the code. Try it. :-)
Ole Melhus
Thanks Ole, it works! One thing though, I wish to just get the current day. With my code above I had li[1] before the h4 tag but can't seem to get it right now, any ideas?
Morgan
Updated the code again. Should only give you todays shows.
Ole Melhus
cool, thanks ole
Morgan