ansaurus

Question

XPATH query, HtmlAgilityPack and Extracting Text

Answer 1

+1 A:

The XPath in the first selection reads "select all document elements that have an attribute named class with a value of tim_new". The stuff in brackets is not what you're returning, it's the criteria you're applying to the search.

I don't have the HTML Agility pack, but if you are trying to query the divs that have "NSE:" as its text, your XPath for the second query should just be "//div" then you'll want to filter using LINQ.

Something like

var nodes = 
    doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);

So in English, "Return all the div elements that immediately contain text to LINQ, then check that the inner text value contains NSE:". Again, I'm not sure the syntax is perfect, but that's the idea.

The XPath "//div[@NSE:]" would return all divs that have and attribute named, NSE:, which would be illegal anyway because ":" isn't allowed in an attribute name. Youre looking for the text of the element, not one of its attributes.

Hope that helps.'

Note: If you have nested divs that both contain text as in <div>NSE: some text<div>NSE: more text</div></div> you're going to get duplicate results.

Laramie 2010-06-06 17:55:51

Soham 2010-06-07 05:19:41

@Soham - If I understand you correctly, in order to select just the first element with class="tim_new" your XPATH should be //a[@class='tim_new'][1]. The [1] returns only the first match of the previous statement. In XML, the would be parsed as text as you assumed.

Laramie 2010-06-07 07:57:42

Is in any way, `//a[@class='tim_new']` equivalent to `//a[@class='tim_new'][1] ? I.e when the array index is not given, the second match is ignored?Laramie, additionally when I use the `var NSECode = doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);Console.WriteLine(NSECode.ToString());` it returns an error which goes like this:system.linq.enumerable+<WhereIterator>d__0'1[HtmlAgilityPack.HtmlNode]

Soham 2010-06-07 08:25:08

//a[@class='tim_new'] and //a[@class='tim_new'][1] are not equivalent. In most XPATH engines, the first one returns all of the <a> tags where class="tim_new". The second one just returns the first match. The question is, what does the HTML Agility pack do with the results. As I mentioned, I'm not familiar with HTML Agility's implementation of XPATH so the results could be different. That is probably why you are getting the exception to the LINQ query. I'm not sure what the property is to get the innerText of the node. I was guessing that it was InnerText, but you'll have to experiment.

Laramie 2010-06-07 16:50:05

thanks Laramie, but what the C# statement which you suggested compiles fully. It just that it breaks down during the run time. Can you suggest me how can I really learn more about LINQ and XPATH

Soham 2010-06-07 16:55:23

The W3C has an XPath tutorial and this site is a good resource for learning the syntax of LINQ by example (http://msdn.microsoft.com/en-us/vcsharp/aa336746.aspx). Microsoft also has webcasts on their events page. You don't actually need LINQ to solve your problem. You can get the same effect looping through the results in the XPATH I gave you and filter out the data you need manually. good luck.

Laramie 2010-06-08 01:25:51

ansaurus

tags:

views:

answers:

XPATH query, HtmlAgilityPack and Extracting Text

related questions