views:

2225

answers:

3

While trying to parse html using Yahoo Query Language and xpath functionality provided by YQL, I ran into problems of not being able to extract “text()” or attribute values.
For e.g.
perma link

select * from html where url="http://stackoverflow.com" 
and xpath='//div/h3/a'

gives a list of anchors as xml

<results>
    <a class="question-hyperlink" href="/questions/661184/filling-the-text-area-with-the-text-when-a-button-is-clicked" title="In ASP.net, I need the code to fill the text area (in the form) when a button is clicked. Can you help me through by showing a simple .aspx code containing the script tag? ">Filling the text area with the text when a button is clicked</a>...
</results>

Now when I try to extract the node value using

select * from html where url="http://stackoverflow.com" 
and xpath='//div/h3/a/text()'

I get results concatenated rather than a node list e.g.

<results>Xcode: attaching to a remote process for debuggingWhy is b
…… </results>

How do I separate it into node lists and how do I select attribute values ?

A query like this

select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a[@href]'

gave me the same results for querying div/h3/a

A: 

You are using YQL? Are you trying to get an element on the page or... I highly suggest that you switch to using jQuery. With jQuery's selectors you will be able to get any element off of the page. In addition jQuery has excellent documentation and community support.

YQL gives acesss to a lot of additional functionality from Yahoo which I need to leverage.
Cherian
+8  A: 

YQL requires the xpath expression to evaluate to an itemPath rather than node text. But once you have an itemPath you can project various values from the tree

In other words an ItemPath should point to the Node in the resulting HTML rather than text content/attributes. YQL returns all matching nodes and their children when you select * from the data.

example

select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

This returns all the a's matching the xpath. Now to project the text content you can project it out using

select content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

"content" returns the text content held within the node.

For projecting out attributes, you can specify it relative to the xpath expression. In this case, since you need the href which is relative to a.

select href from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

this returns <results> <a href="/questions/663973/putting-a-background-pictures-with-leds"/> <a href="/questions/663013/advantages-and-disadvantages-of-popular-high-level-languages"/> .... </results>

If you needed both the attribute 'href' and the textContent, then you can execute the following YQL query:

select href, content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

returns:

<results> <a href="/questions/663950/double-pointer-const-issue-issue">double pointer const issue issue</a>... </results>

Hope that helps. let me know if you have more questions on YQL.

Nagesh Susarla
Works like a charm!
Cherian
A: 

Nagesh: Thanks so much for that! I didn't know you could narrow the select down to href and content. I was breaking my head all day trying to find how to do that. Awesome! Thanks so much!

add this as a comment and not as an answer
Cherian