ansaurus

Question

[Ruby] open-uri + hpricot & nokogiri don't parse html correctly

Answer 1

+3 A:

There's no div with id 'pasajes' in that page. That's the problem.

JtR 2009-08-31 14:38:52

How can i see that there's no div with id 'pasajes'. Viewing the source code i can find that div with that id. I don't understand why you say that div don't exists...Thanks

juanmaflyer 2009-08-31 17:48:11

If I try to view the source with firefox it doesn't have divs with that id. In which line the div is in?

JtR 2009-08-31 18:13:53

Ok, I managed to get it showing. Site modifies it's content based on some cookie/country/something. That's why it's probably showing with your browser, but not with python. You can try to track down the thing responsible for causing modified content and then configure your script to repeat it. It's probably some cookie and here's instructions on how to use cookies with python and http. http://www.jayconrod.com/cgi/view_post.py?17

JtR 2009-08-31 18:18:09

Answer 2

+3 A:

There's no DIV with the id pasajes in the static HTML page. If you are running *nix you can see that by doing:

curl http://www.despegar.com.ar/ | grep pasajes

My guess is that it's JavaScript-generated.

If you are using MacRuby you could try Lyndon.

Jonas Elfström 2009-08-31 14:40:29

Ah, that could be the problem.Is anyway to parse that? or watir will be my only alternative?Thanks

juanmaflyer 2009-08-31 15:49:43

I am not running *nix. I'm on windows xp. And with firebug or ie Developer toolbar or using in firefox "see the source could" I can see the 'pasajes' div. Why you say there's no div 'pasajes?

juanmaflyer 2009-08-31 17:47:04

There's no <div id="pasajes"> in the static page, it's put there with JavaScript. I'm starting to get cold feets over this. Could you please explain why you want to monitor despegars offers?

Jonas Elfström 2009-08-31 19:35:56

+1 for the link to lyndon

ADAM 2009-10-26 19:31:47

Answer 3

+1 A:

This fits more as an additional comment on Jonas' answer above rather than an answer in itself... But I am new to SO and do not have the "commenting powers" yet :)

You can use Selenium RC to download the full HTML and then use nokogiri on the downloaded file. Note that this will work only if the content is being generated/modified by Javascript. If the webpage depends on cookies to setup the content your options would be Selenium (in the browser) or watir as you have noted.

I would love to hear a better solution to this (want to parse webpage with nokogiri, but the page is modified by JS).

arnab 2009-09-03 07:04:48

Answer 4

A:

I ran into a similar issue with Nokogiri but on OS X 10.5. However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. I found by using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see lots of wonderful HTML. I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. I even had to modify the very demo they use on rubyforge to teach you about Nokogiri.

Using their own example I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=>

YUCK!

If I tweak to read in the url to a string, I get good stuff:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note I do see this lovely warning when I use irb to play:

HI. You're using libxml2 version 2.6.16 which is over 4 years old and has plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, you upgrade your version of libxml2 and re-install nokogiri. If you like using libxml2 version 2.6.16, but don't like this warning, please define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri.

But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. So I say, "no way".

Why do I write this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on old stuff and they may have issues with that.

QUESTION

Do any other OS X 10.5 users have this issue with Nokogiri?

John Ferguson 2009-10-10 18:41:38

seperate question isnt it?

ADAM 2009-10-26 19:34:56

ansaurus

tags:

views:

answers:

[Ruby] open-uri + hpricot & nokogiri don't parse html correctly

related questions