views:

48

answers:

1

Hi all,

I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.

I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)

Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:

* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[@class="topnewsinfotitle "]
* .//div[@class="topnewsinfotitle "]

I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!

Any clue or hint?

here is a chunk of the code:

listBox2.ResetText();

        HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
        HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[@class=\"topnewsinfotitle \"]");

        listBox2.Items.Add(nc.Count+" Items selected!");

        foreach (HtmlAgilityPack.HtmlNode node in nc) {
            listBox2.Items.Add(node.InnerText);
        }

Thanks.

A: 

Hi,

I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.

//div[@class='topnewsinfotitle ']/text()

Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6

//div[@class='topnewsinfotitle']/text()

Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj

However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):

//div[contains(@class, 'topnewsinfotitle')]/text()

(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)

BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.

lasseespeholt