tags:

views:

256

answers:

5

This might be a similar problem to my earlier two questions - see here and here but I'm trying to use the _detail command to automatically click the link so I can scrape the details page for each individual event.

The code I'm using is:

require 'rubygems'
require 'scrubyt'

nuffield_data = Scrubyt::Extractor.define do
  fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail do
      dates "1-4 October"
      times "7:30pm"
    end
  end

  next_page "Next Page", :limit => 20
end

  nuffield_data.to_xml.write($stdout,1)

Is there any way to print out the URL that using the event_detail is trying to access? The error doesn't seem to give me the URL that gave the 404.

Update: I think the link may be a relative link - could this be causing problems? Any ideas how to deal with that?

+1  A: 
    sudo gem install ruby-debug

This will give you access to a nice ruby debugger, start the debugger by altering your script:

    require 'rubygems'
    require 'ruby-debug'
    Debugger.start
    Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)

    require 'scrubyt'

    nuffield_data = Scrubyt::Extractor.define do
      fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

      event do
        title 'The Coast of Mayo'
        link_url
        event_detail do
          dates "1-4 October"
          times "7:30pm"
        end
      end

      next_page "Next Page", :limit => 2

    end

    nuffield_data.to_xml.write($stdout,1)

Then find out where scrubyt is throwing an exception - in this case:

    /Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'

Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:

      if @@current_doc_protocol == 'file'
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
      else
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
        store_host_name(self.get_current_doc_url)   # in case we're on a new host
      end
    rescue
      debugger
      self # the self is here because debugger doesn't like being at the end of a method
    end

Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:

@@current_doc_url

You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.

This is basically how I figured out the answer to your previous questions.

Good luck.

A: 

Thank you for the very useful answer. I have tried this and @@current_doc_url seems to be nil when I look at it. Any ideas why this might be the case? I've tried to use the debugger to look at the value that has been passed into the function - but that seems to be nil too!

Any ideas where else I could look, or why it might be coming up as nill?

robintw
A: 

Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.

A: 

I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.

robintw
+1  A: 

I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
      dates "1-4 October"
      times "7:30pm"
    end
  end
Rohan