views:

625

answers:

4

I decided to give Nokogiri a try, and copied the following program straight from http://nokogiri.rubyforge.org/nokogiri/Nokogiri.html (adding only the require 'rubygems' and the I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 constant):

require 'rubygems'
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 = 1
require 'nokogiri'
require 'open-uri'

# Get a Nokogiri::HTML:Document for the page we’re interested in...

doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

# Do funky things with it using Nokogiri::XML::Node methods...

####
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
  puts link.content
end

It returned no results. But when I changed

    doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

to

    doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)

the program worked as expected. Notice that the only difference was the addition of the .read at the end of the line. I would never have figured this out by myself, because just about every bit of example code leaves off the .read. The one place that included it, ironically was a post by one of the Nokogiri developers (at http://tenderlovemaking.com/2008/11/18/underpant-free-excitement). Did something in the API change? What am I missing?

I'm using Nokogiri 1.3.2.

Thank you.

A: 

I copied and pasted your (original) code into a Ruby file and ran it on my system (ruby 1.8.6p369, Nokogiri 1.3.2) and it worked fine. Might there be something else in your environment that could be causing the problem? Nokogiri aside, what does open('http://www.google.com/search?q=tenderlove') return for you?

Greg Campbell
Just for confirmation: The first snippet works here as well.
Chuck
open('http://www.google.com/search?q=tenderlove') returns #<File:/var/folders/lX/lXtmfMFWFDmYij9VLHquDk+++TI/-Tmp-/open-uri.5470.0>
gauth
A: 

Not sure what your issue is, but the call to open is from open-uri not nokogiri. So do some experimenting taking nokogiri out of play.

$ irb
>> require 'open-uri'
=> true
>> f = open('http://www.google.com/search?q=tenderlove')
=> #<File:/var/folders/LA/LACsuKOVHtaEgmBzsJcGAE+++TI/-Tmp-/open-uri.7455.0>
>> f.read
=> "<!doctype html><head><title>tenderlove - Google Search</title>...
Aaron Hinni
My irb session behaves just like yours. If I use Norokiri without the .read, it returns <!DOCTYPE html> and nothing else. If I use a .read, I get the full file. So I don't know why it is that passing the open URI to Nokogiri activates the read method for everyone's installation but mine.
gauth
A: 

I upgraded to Nokogiri 1.3.3, and upgraded libxml2 to 2.7.3. I no longer need to use the ridiculous I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 = 1 statement to avoid error messages, and the program works without the extraneous .read.

gauth
A: 

It's always good to check your version of Nokogiri and libxml to make sure they're current.

As of today (9/22/09) this is current on MacOS:

nokogiri -v
--- 
nokogiri: 1.3.3
warnings: [ ]

libxml: 
  compiled: 2.7.4
  loaded: 2.7.4
  binding: extension

(I put a space inside the empty warnings array to keep it from looking like a box.)

Greg