tags:

views:

1292

answers:

7

It seems like the methods of Ruby's Net::HTTP are all or nothing when it comes to reading the body of a web page. How can I read, say, the just the first 100 bytes of the body?

I am trying to read from a content server that returns a short error message in the body of the response if the file requested isn't available. I need to read enough of the body to determine whether the file is there. The files are huge, so I don't want to get the whole body just to check if the file is available.

A: 

You can't. But why do you need to? Surely if the page just says that the file isn't available then it won't be a huge page (i.e. by definition, the file won't be there)?

+2  A: 

Are you sure the content server only returns a short error page ?

Doesn't it also set the HTTPResponse to something appropriate like 404. In which case you can trap the HTTPClientError derived exception (most likely HTTPNotFound) which is raised when accessing Net::HTTP.value().

If you get an error then your file wasn't there if you get 200 the file is starting to download and you can close the connection.

Jean
+1  A: 

To read the body of an HTTP request in chunks, you'll need to use Net::HTTPResponse#read_body like this:

http.request_get('/large_resource') do |response|
  response.read_body do |segment|
    print segment
  end
end
Nathan de Vries
Tried this. request_get still want to download the whole file before processing the block.
bvanderw
+6  A: 

Shouldn't you just use an HTTP HEAD request (Ruby Net::HTTP::Head method) to see if the resource is there, and only proceed if you get a 2xx or 3xx response? This presumes your server is configured to return a 4xx error code if the document is not available. I would argue this was the correct solution.

An alternative is to request the HTTP head and look at the content-length header value in the result: if your server is correctly configured, you should easily be able to tell the difference in length between a short message and a long document. Another alternative: set the content-range header field in the request (which again assumes that the server is behaving correctly wrt the HTTP spec).

I don't think that solving the problem in the client after you've sent the GET request is the way to go: by that time, the network has done the heavy lifting, and you won't really save any wasted resources.

Reference: http header definitions

Ian

Ian Dickinson
Tried that, The server sends an OK response and a 0 for content-length. This is the P4Web server from Perforce.
bvanderw
Hmm. If your vendor sends 200 OK when it really means is 404 not found then you should raise a priority bugrep with them!
Ian Dickinson
+1  A: 

I have tried just getting the header and that doesn't work.

The server (this is the Perforce web server p4Web, by the way) sends an OK response and it doesn't return any of the other values like content-length. Tried that already. The error message is in the body. I want to read just enough of the body to determine whether this error message exists.

bvanderw
+1  A: 

I wanted to do this once, and the only thing that I could think of, is monkey patching Net::HTTP#read_body and Net::HTTP#read_body_0 methods to accept a length parameter and then in the former just pass the length parameter to the read_body_0 method, where you can read only as much as length bytes.

Roman
A: 

I have a similar problem. I tried to use the Range header:

require "net/https"
require "uri"

uri = URI.parse("http://www.example.com/index.html")
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
request.range = (0..99)
# I tried this, too:
# request.initialize_http_header({
#   "Range" => "bytes=0-99"
# })

response = http.request(request)
puts response.body

But it still does not work: I get the full body response, I don't know why. A HEAD request is out of question for me since I need to read the first bytes of the files (without downloading the whole thing).

Hashmalech