views:

345

answers:

1

I had been trying to extract links from a class called "tim_new" . I have been given a solution as well.

Both the solution, snippet and necessary information is given here

The said XPATH query was "//a[@class='tim_new'], my question is, how did this query differentiate between the first line of the snippet (given in the link above and the second line of the snippet).

More specifically, what is the literal translation (in English) of this XPATH query.


Furthermore, I want to write a few lines of code to extract the text written against NSE:

<div class="FL gL_12 PL10 PT15">BSE: 523395 &nbsp;&nbsp;|&nbsp;&nbsp; NSE: 3MINDIA &nbsp;&nbsp;|&nbsp;&nbsp; ISIN: INE470A01017</div>

Would appreciate help in forming the necessary selection query.

My code is written as:

IEnumerable<string> NSECODE = doc.DocumentNode.SelectSingleNode("//div[@NSE:]");

But this doesnt look right. Would appreciate some help.

+1  A: 

The XPath in the first selection reads "select all document elements that have an attribute named class with a value of tim_new". The stuff in brackets is not what you're returning, it's the criteria you're applying to the search.

I don't have the HTML Agility pack, but if you are trying to query the divs that have "NSE:" as its text, your XPath for the second query should just be "//div" then you'll want to filter using LINQ.

Something like

var nodes = 
    doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);

So in English, "Return all the div elements that immediately contain text to LINQ, then check that the inner text value contains NSE:". Again, I'm not sure the syntax is perfect, but that's the idea.

The XPath "//div[@NSE:]" would return all divs that have and attribute named, NSE:, which would be illegal anyway because ":" isn't allowed in an attribute name. Youre looking for the text of the element, not one of its attributes.

Hope that helps.'

Note: If you have nested divs that both contain text as in <div>NSE: some text<div>NSE: more text</div></div> you're going to get duplicate results.

Laramie
Soham
@Soham - If I understand you correctly, in order to select just the first element with class="tim_new" your XPATH should be //a[@class='tim_new'][1]. The [1] returns only the first match of the previous statement. In XML, the   would be parsed as text as you assumed.
Laramie
Is in any way, `//a[@class='tim_new']` equivalent to `//a[@class='tim_new'][1] ? I.e when the array index is not given, the second match is ignored?Laramie, additionally when I use the `var NSECode = doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);Console.WriteLine(NSECode.ToString());` it returns an error which goes like this:system.linq.enumerable+<WhereIterator>d__0'1[HtmlAgilityPack.HtmlNode]
Soham
//a[@class='tim_new'] and //a[@class='tim_new'][1] are not equivalent. In most XPATH engines, the first one returns all of the <a> tags where class="tim_new". The second one just returns the first match. The question is, what does the HTML Agility pack do with the results. As I mentioned, I'm not familiar with HTML Agility's implementation of XPATH so the results could be different. That is probably why you are getting the exception to the LINQ query. I'm not sure what the property is to get the innerText of the node. I was guessing that it was InnerText, but you'll have to experiment.
Laramie
thanks Laramie, but what the C# statement which you suggested compiles fully. It just that it breaks down during the run time. Can you suggest me how can I really learn more about LINQ and XPATH
Soham
The W3C has an XPath tutorial and this site is a good resource for learning the syntax of LINQ by example (http://msdn.microsoft.com/en-us/vcsharp/aa336746.aspx). Microsoft also has webcasts on their events page. You don't actually need LINQ to solve your problem. You can get the same effect looping through the results in the XPATH I gave you and filter out the data you need manually. good luck.
Laramie