I'm writing a specialized PHP proxy and got stumped by a feature of cURL.
If the following values are set:
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_HEADER, true );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
cURL correctly handles redirects, but returns ALL page headers, not just the final (non-redirect) page, e.g.
HTTP/1.1 302 Found
Location: http://otherpage
Set-Cookie: someCookie=foo
Content-Length: 198
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 3241
<!DOCTYPE HTML>
...rest of content
Note that CURLOPT_HEADER is set because I need to read and copy parts of the original header into my proxy header.
I appreciate why it's returning all these headers (for example, my proxy code must detect any cookies set in the 302 header and pass them along). HOWEVER, it also makes it impossible to detect when the headers end and the content begins. Normally, with one header we could just do a simple split:
$split = preg_split('/\r\n\r\n/', $fullPage, 2)
But that obviously won't work here. Hm. We could try something that only splits if it looks like the next line is part of a header:
$split = preg_split('/\r\n\r\nHTML\/(1\.0|1\.1) \\d+ \\w+/', $fullPage)
// matches patterns such a "\r\n\r\nHTML/1.1 302 Found"
Which will work almost all the time, but chokes if someone has the following in their page:
...and for all you readers out there, here is an example HTTP header:
<PRE>
HTTP/1.1 200 OK
BALLS!
We really want the split to stop matching as soon as it encounters any pattern of \r\n\r\n
that isn't immediately followed by HTML/1.x
- is there a way to do this with PHP RegExs? Even this solution can choke on the (quite rare) situation where someone puts an HTTP header right at the beginning of their content. Is there a way in cURL to get all of the returned pages as an array?