views:

327

answers:

3

I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes "Fetch Page" with its "Cut content" feature does not work well) but the fragment I am interested in always has the same class attribute.

If I have an HTML page like this:

<html>
  <body>
    <div class="foo">
      <p>Wolf</p>
      <ul>
        <li>Dog</li>
        <li>Cat</li>
      </ul>
    </div>
  </body>
</html>

and use a YQL expression like this:

SELECT * FROM html 
WHERE url="http://example.com/containing-the-fragment-above" 
AND xpath="//div[@class='foo']"

what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I've tried SELECT content as well, but that only selects textual content. I want HTML. Is this possible?

A: 

YQL converts the page into XML, then does your XPath on it, then takes the DOMNodeList and serializes that back to XML for your output (and then converts to JSON if needed). You can't access the original data.

Why can't you deal with XML instead of HTML?

Paul Tarjan
I'm using this in the context of Yahoo Pipes, so I want to insert the HTML into an RSS feed to be rendered by a feed reader/browser. Inserting the XML might work, but the Pipes YQL module seems to just insert the DOM elements into the document; I don't see a way to get the XML source either.
Joe Shaw
+1  A: 

I had this same exact problem. The only way I have gotten around it is to avoid YQL and just use regular expressions to match the start and end tags :/. Not the best solution, but if the html is relatively unchanging, and the pattern just from say <div class='name'> to <div class='just_after>`, then you can get away with that. Then you can get the html between.

viatropos
Yeah, this is what I ended up doing too. Unfortunately the structure of the page changes depending on what type of entry it is, so I end up having to split the feed several times to handle all the different types and merge/sort them back together. A real pain, but it works.
Joe Shaw
+3  A: 

You could write a little Open Data Table to send out a normal YQL html table query and stringify the result. Something like the following:

<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd"&gt;
  <meta>
    <sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
    <description>Retrieve HTML document fragments</description>
    <author>Peter Cowburn</author>
  </meta>
  <bindings>
    <select itemPath="result.html" produces="JSON">
      <inputs>
        <key id="url" type="xs:string" paramType="variable" required="true"/>
        <key id="xpath" type="xs:string" paramType="variable" required="true"/>
      </inputs>
      <execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
    </select>
  </bindings>
</table>

You could then query against that custom table with a YQL query like:

use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where 
  url="http://finance.yahoo.com/q?s=yhoo" 
  and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'

Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)

salathe
Beautiful! Thank you. The only issue I have now is how to get a Yahoo Pipes variable into the YQL expression. For example, select * from html.tostring where url=item.link and xpath='//div[@id="foo"]'gives back the error "Invalid identfier item.link. me is the only supported identifier in this context." Any ideas how I do that?(Sorry for the butchered code snippet, looks like comments don't allow much in the way of formatting)
Joe Shaw
Figured out the answer to this: create a separate pipe which takes a URL input, inserts that into a string builder which builds the YQL query, and attach that as the query to the YQL widget. Then in your main pipe, use this new pipe and pass in the URL as the input to it.I think I'll probably open a new question for this specifically so people don't have to hunt it down in the comments of this one.
Joe Shaw
Opened: http://stackoverflow.com/questions/2889406/how-do-i-pass-a-yahoo-pipes-item-into-a-yql-query
Joe Shaw