views:

21

answers:

1

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q','ruby'
  submit

  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end

google_data.to_xml.write($stdout, 1);

The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url> tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url or is there a way of extracting the text from the xml content held in link_url?

This is some of the content that gets printed by the google_data.to_xml.write():

<root>
  <link_title>
    Ruby Programming Language
    <link_url>http://ruby-lang.org/&lt;/link_url&gt;
  </link_title>
  <link_title>
    Download Ruby
    <link_url>http://www.ruby-lang.org/en/downloads/&lt;/link_url&gt;
  </link_title>
  <link_title>
    Ruby - The Inspirational Weight Loss Journey on the Style Network ...
    <link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp&lt;/link_url&gt;
  </link_title>
  <link_title>
    Ruby (programming language) - Wikipedia, the free encyclopedia
    <link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)&lt;/link_url&gt;
  </link_title>
</root>
A: 

I'd think about alternatives. Scrubyt hasn't been updated in a while, and the forums have been shut down.

Mechanize can do what the Extractor does, Nokogiri can parse XML or HTML responses, and Builder can create XML (though it seems like you don't really want XML).

Mark Thomas