tags:

views:

280

answers:

4

Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.

I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.

I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.

One approach I'm considering is

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?

A: 

It will download the whole page with the fopen call, but then it will only read 6kb from that page.

From the PHP manual:

Reading stops as soon as one of the following conditions is met:

  • length bytes have been read
James Skidmore
+3  A: 

This is more an HTTP that a CURL question in fact.

As you guessed, the whole page is going to be downloaded if you use fopen. No matter then if you seek at offset 5000 or not.

The best way to achieve what you want would be to use a partial HTTP GET request, as stated in HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):

The semantics of the GET method change to a "partial GET" if the request message includes a Range header field. A partial GET requests that only part of the entity be transferred, as described in section 14.35. The partial GET method is intended to reduce unnecessary network usage by allowing partially-retrieved entities to be completed without transferring data already held by the client.

The details of partial GET requests using Ranges is described here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2

Aurélien Vallée
+1  A: 

try a HTTP RANGE request:

GET /largefile.html HTTP/1.1
Range: bytes=0-6000

if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.

see also Resumable downloads when using PHP to send the file?.

ax
+1  A: 

You may be able to also accomplish what you're looking for using CURL as well.

If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.

The libcurl documentation describes the callback a bit more:

This function gets called by libcurl as soon as there is data received that needs to be saved. Return the number of bytes actually taken care of. If that amount differs from the amount passed to your function, it'll signal an error to the library and it will abort the transfer and return CURLE_WRITE_ERROR.

The callback function will be passed as much data as possible in all invokes, but you cannot possibly make any assumptions. It may be one byte, it may be thousands.

Keith Palmer
I've marked this as the accepted answer as it is more robust than the HTTP Range request which may not always be supported, and I can only mark one answer.
James