views:

1298

answers:

4

I'm trying to parse a webpage using open-uri + hpricot but it seems to be a problem in the parsing proccess as the gems don't bring me the things I want.

Specifically I want to get this div (whose id is 'pasajes') in this url:

http://www.despegar.com.ar

I write this code:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

But it bring NOTHING! I've tried lot of things in both hpricot and nokogiri:

  1. I try giving the absolute path to that div
  2. I try CSS path with selectors
  3. I try with hpricot search shortcut (doc//"div#pasajes")
  4. Almost every posible relative path to reach the 'pasajes' div

Finally i found a horrible solution. I have used the watir library and after open a web browser, i have passed the html to hpricot. In this way hpricot DO RECOGNIZE the 'pasajes' div. But i don't want just to open a web-browsere only for parsing purposes...

What I'm doing wrong? Is open-uri working bad? Is hpricot?

+3  A: 

There's no div with id 'pasajes' in that page. That's the problem.

JtR
How can i see that there's no div with id 'pasajes'. Viewing the source code i can find that div with that id. I don't understand why you say that div don't exists...Thanks
juanmaflyer
If I try to view the source with firefox it doesn't have divs with that id. In which line the div is in?
JtR
Ok, I managed to get it showing. Site modifies it's content based on some cookie/country/something. That's why it's probably showing with your browser, but not with python. You can try to track down the thing responsible for causing modified content and then configure your script to repeat it. It's probably some cookie and here's instructions on how to use cookies with python and http. http://www.jayconrod.com/cgi/view_post.py?17
JtR
+3  A: 

There's no DIV with the id pasajes in the static HTML page. If you are running *nix you can see that by doing:

curl http://www.despegar.com.ar/ | grep pasajes

My guess is that it's JavaScript-generated.

If you are using MacRuby you could try Lyndon.

Jonas Elfström
Ah, that could be the problem.Is anyway to parse that? or watir will be my only alternative?Thanks
juanmaflyer
I am not running *nix. I'm on windows xp. And with firebug or ie Developer toolbar or using in firefox "see the source could" I can see the 'pasajes' div. Why you say there's no div 'pasajes?
juanmaflyer
There's no <div id="pasajes"> in the static page, it's put there with JavaScript. I'm starting to get cold feets over this. Could you please explain why you want to monitor despegars offers?
Jonas Elfström
+1 for the link to lyndon
ADAM
+1  A: 

This fits more as an additional comment on Jonas' answer above rather than an answer in itself... But I am new to SO and do not have the "commenting powers" yet :)

You can use Selenium RC to download the full HTML and then use nokogiri on the downloaded file. Note that this will work only if the content is being generated/modified by Javascript. If the webpage depends on cookies to setup the content your options would be Selenium (in the browser) or watir as you have noted.

I would love to hear a better solution to this (want to parse webpage with nokogiri, but the page is modified by JS).

arnab
A: 

I ran into a similar issue with Nokogiri but on OS X 10.5. However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. I found by using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see lots of wonderful HTML. I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. I even had to modify the very demo they use on rubyforge to teach you about Nokogiri.

Using their own example I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=>

YUCK!

If I tweak to read in the url to a string, I get good stuff:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note I do see this lovely warning when I use irb to play:

HI. You're using libxml2 version 2.6.16 which is over 4 years old and has plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, you upgrade your version of libxml2 and re-install nokogiri. If you like using libxml2 version 2.6.16, but don't like this warning, please define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri.

But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. So I say, "no way".

Why do I write this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on old stuff and they may have issues with that.

QUESTION

Do any other OS X 10.5 users have this issue with Nokogiri?

John Ferguson
seperate question isnt it?
ADAM