views:

285

answers:

3

I am consuming various XML-over-HTTP web services returning large XML files (> 2MB). What would be the fastest ruby http library to reduce the 'downloading' time?

Required features:

  • both GET and POST requests

  • gzip/deflate downloads (Accept-Encoding: deflate, gzip) - very important

I am thinking between:

  • open-uri

  • Net::HTTP

  • curb

but you can also come with other suggestions.

P.S. To parse the response, I am using a pull parser from Nokogiri, so I don't need an integrated solution like rest-client or hpricot.

+2  A: 

http://github.com/pauldix/typhoeus

might be worth checking out. It's designed for large and fast parallel downloads and is based on libcurl so it is pretty solid.

That said, test Net::HTTP and see if the performance is acceptable before doing something more complicated.

Ben Hughes
+1  A: 

The fastest download is probably a #read on the IO object which slurps in the whole thing into a single String. After that you can apply your processing. Or do you require to have the file processed during download?

Robert Klemme
I don't want it to be that complicated. On your solution, i should write my own http wrapper, with gzip/deflate compressing and post support.
Vlad Zloteanu
I don't understand what you think is so complicated about my suggestion. Using IO#read with Net:HTTP to read in the whole thing into one string is probably as easy as you can get for downloading.
Robert Klemme
+3  A: 

You can use EventMachine and em-http to stream the XML:

require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'nokogiri'

# this is your SAX handler, I'm not very familiar with
# Nokogiri, so I just took an exaple from the RDoc
class SteamingDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs=[])
    puts "starting: #{name}"
  end

  def end_element(name)
    puts "ending: #{name}"
  end
end

document = SteamingDocument.new
url = 'http://stackoverflow.com/feeds/question/2833829'

# run the EventMachine reactor, this call will block until 
# EventMachine.stop is called
EventMachine.run do
  # Nokogiri wants an IO to read from, so create a pipe that it
  # can read from, and we can write to
  io_read, io_write = IO.pipe

  # run the parser in its own thread so that it can block while
  # reading from the pipe
  EventMachine.defer(proc {
    parser = Nokogiri::XML::SAX::Parser.new(document)
    parser.parse_io(io_read)
  })

  # use em-http to stream the XML document, feeding the pipe with
  # each chunk as it becomes available
  http = EventMachine::HttpRequest.new(url).get
  http.stream { |chunk| io_write << chunk }

  # when the HTTP request is done, stop EventMachine
  http.callback { EventMachine.stop }
end

It's a bit low-level perhaps, but probably the most performant option for any document size. Feed it hundreds of megs and it will not fill up your memory, as any non-streaming solution would (as long as you don't keep to much of the document you're loading, but that's on your side of things).

Theo