views:

641

answers:

1

hi, I have read a large deal of tutorials to help out and under Hpricot, the problem that i am finding out it is not scraping all the Html so to speak. I'll elaborate:

The website i am attempting to scrape html off is http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx .

I require to obtain the links that are listed as results ( i need to do this for possible any url on the aforementioned site and hence RSS or such is not beneficial as i need the program to read them off on-the-fly given any url i feed.)

I have tried everything to pull off the specific ID i require (giving in the direct XPATH so on an so forth) but i realised that when i do

doc = Hpricot(open("http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx", 'User-Agent'=>'ruby')) str = doc puts str

the result provided excludes all the html related to the links i need! So which ever method i use to scrape, its not finding the elements required as they are not there according to hpricot.

When i view the Source code in Firefox , i do see them however so i'm very confused. Is there anyone who knows how to go around this issue pls? I have been trying to find my way for ages and i cant manage to find a solution alone! Any help would be highly appreciated

+2  A: 

Hi Erika,

It looks like the site is doing something with the User-Agent. If I change that property to match what my version of Firefox sends, I get the full response body. When I left the property as 'ruby', the response was incomplete. Not sure what the root cause is, but this seemed to alleviate the symptoms.

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = open("http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx", 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }
puts doc.search('h6')

Hope this helps!

Eric
Worked like a charm!Thank you so much!! you are a life saver <3
Erika