views:

413

answers:

4

I maintain a website showing my university group's publications.

I have written a quick and dirty Ruby script to parse a CSV file containing this data (which I grab manually from ISI Web of Science website), and present it in a nice format in HTML.

There is no direct link to a PDF file in the CSV file. Instead, there is information which I can use to go to doi.org, which forwards on to the real page (hosted by the journal), with a link for the PDF.

I want to be able to, for each publication in the CSV file, go to that web page, and grab the PDF.

I've never done this before. Using wget in a terminal, this works fine, except that the HTML link on the journal website is simply "/link info", without the TLD info.

Can anyone recommend a simple way of going about this, please?

+1  A: 

I'm not entirely clear what you're trying to do, but you can probably accomplish it with Mechanize or watir. Mechanize parses and interacts with websites directly, but it doesn't support much in the way of Javascript. Watir allows you to drive an actual browser. Which is best for you depends on what you're actually doing.

Pesto
Using an actual gui browser is realy overkill, for just downloading a file. You should also mind, that not every script is run on a gui system.
johannes
@johannes: You might want to consider reading answers before commenting. I also recommended Mechanize, which *doesn't* need a graphical environment. But, as my answer mentions, it doesn't handle a good deal of Javascript. If that were an issue, I provided an alternative solution.
Pesto
+1  A: 

I don't know about the ruby, but doi.org will return a redirect as an HTTP code 302 (Moved Temporarily), along with a header called "Location:" that contains the publisher's website link. Then you'll have to scrape that page to find the PDF.

Jim Downing
A: 

Since you're already writing Ruby, this seems like a great fit for ScRUBYt.

hgimenez
+1  A: 

A simple solution would be to use wget from inside ruby.

system("wget -O \"#{target}\" \"#{source\"")
  • system returns true or false wether wget returned 0 or something else
  • be sure to properly escape target and source, or somebody might take over your system
  • if you don't want wget's output in your teminal append "> /dev/null 2> /dev/null" to the system argument

A cleaner solution would be to use Net::HTTP. The following example is taken from the Net::HTTP docs. Have a look at http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html for more info.

require 'net/http'
require 'uri'

def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  response = Net::HTTP.get_response(URI.parse(uri_str))
  case response
  when Net::HTTPSuccess     then response
  when Net::HTTPRedirection then fetch(response['location'], limit - 1)
  else
    response.error!
  end
end
johannes