views:

496

answers:

2

Hi, I am having a problem Scraping Code i require to extract information for a Web MashUp i'm creating.

Basically, I am trying to Scrap Code from:

http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx

This is just one of the pages i will need to scrape and hence i cannot feed the program directly the code i need =/.

When i Scrape the Page using the following code (in Hpricot)

puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

I am noticing that instead of the part of code i require, i am only seeing the script reference, namely

<script type="text/javascript" src="http://maps.google.com/maps?file=api&amp;amp;v=2&amp;amp;sensor=false&amp;amp;key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"&gt;&lt;/script&gt;&lt;title&gt;

Beautimport Ltd (Balmain Hair Extensions) in Malta | Yellow Pages?? (Malta) Ltd | YellowPages.com.mt

This is also what i see when i do view source on Firefox. However when i hover over the elements in Firebug, I am able to get an XPath, which unfortunately is not working due to the script reference remaining such. (i'm not sure if i'm explaining is correct). I would really require all the code that is generated on the page due to the script (which is far only viewable in firebug). I would need this so that i can extract the following (taken from firebug by hovering over the Google Icon on the map:

<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&amp;spn=0.006988,0.015922&amp;z=16&amp;key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&amp;sensor=false&amp;mapclient=jsapi&amp;oi=map_misc&amp;ct=api_logo" target="_blank">

which gives the following Xpath (//denotes a tbody), but as i mentioned, as it is not giving the entire code in Hpricot, its pretty useless as it can't get to it!

/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a

In this manner i would be able to extract the Lng and Lat which i really require for my project. I really dont know how to go about this in another manner using Hpricot as its not giving me all the code i require. Any Help would be extremely appreciate.

+1  A: 

This type of screen scraping won't work because you're trying to grab elements that are added to the page dynamically after the page's HTML has been sent to the browser. In this case, the browser is hpricot, and all it's seeing is the content as sent from the server, rather than the content after the page's javascript has been run.

The reason that Firebug can see the elements you're trying to grab is that Firebug analyzes the current state of a page in the browser, which includes the dynamic scripty goodness from Google Maps.

Tim S. Van Haren
+4  A: 

This was a fun one. It can be done, but it's going to take more that hpricot. I noticed while sniffing that a webservice is being called to populate the latitude and longitude. Here's what you can do to get to that information:

Scrape the site like you're normally doing, but look for a call to the LoadMap javascript function. The line will look something like:

<script type='text/javascript'>LoadMapByDetail(1668154, 0, 1)</script>

Parse the id out and call the webservice. This will end up looking something like:

require 'rubygems'
require 'hpricot' 
require 'open-uri' 
require 'soap/wsdlDriver'

WSDL_URL="http://yellowpages.com.mt/Web_Service/SearchMap.asmx?WSDL" 
soap = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver 
response = soap.GetCoordByDetail(:mainDetailID => '1668154', :type => '1')
soap.reset_stream response.getCoordByDetailResult.anyType.each { |x| puts x.anyType }

You see the latitude and longitude in the output:

35.88805
14.46627

Hope this helps. Good luck!

Eric
you are seriously a genius Eric! thank you so much, i wouldn't have arrived to a solution without your help. Thanks once again
Erika