tags:

views:

27

answers:

1

I am using YQL for some screen scraping, and any quote-like characters are not being returned properly.

For example, markup on the page being scraped is:

There should not be a “split between what we think and what we do,”  

This is returned by YQL as:

There should not be a �split between what we think and what we do,� 

This also happens with ticks and back-ticks.

My JS is like:

var qurlString = '&url=' + encodeURIComponent(url);
$.ajax({
  type: "POST",
  url: "/k_sys/qurl.php",
  datatype: "xml",
  data: qurlString,
  success: function(data) {
    //do something
  }
});

And my qurl.php is like:

  $BASE_URL = "http://query.yahooapis.com/v1/public/yql";
  $url = my scraped site url;
  $yql_query = "select * from html where url='$url'";
  $yql_query_url = $BASE_URL . "?q=" . urlencode($yql_query) . "&format=xml";
  $session = curl_init($yql_query_url);
  curl_setopt($session, CURLOPT_RETURNTRANSFER,true);
  $xml = curl_exec($session);
  echo $xml;

Is this a cURL issue or a YQL issue, and what to I need to do to fix it?

Thanks!

A: 

This sounds like a character encoding issue. The site you are scraping may be setting the character set using a meta tag in the head element instead of configuring the server to properly identify the character encoding in the http header. Find out the character encoding used by the site (you should be able to find this in your browser's view menu) and add the charset key to your YQL query.

Example from the YQL guide: select * from html where url='http://example.com' and charset='iso- 8559-1'

robartsd