tags:

views:

102

answers:

1

Hi,

I'm getting this strange behavior when fetching a website using socket. The string returned from get_content() function below include some "extra information" that are not present on the original website.

function get_content($a, $b, $c = "00")
{
    $request  = "arg01="  . $a;
    $request .= "&arg02="  . $b;
    $request .= "&arg03="  . $c;

    $host = "www.site.com";
    $script = "/page.php";
    $method = "POST";

    $request_length = strlen($request);

    $header = "$method $script HTTP/1.1\r\n";
    $header .= "Host: $host\r\n";
    $header .= "Content-Type: application/x-www-form-urlencoded\r\n";
    $header .= "Content-Length: $request_length\r\n";
    $header .= "Connection: close\r\n\r\n";
    $header .= "$request\r\n";

    $socket = @fsockopen($host, 80, $errno, $errstr);
    if ($socket) {
        fputs($socket, $header);
            while(!feof($socket)) {
                $output .= fgets($socket);
            }
        fclose($socket);
    }

    return $output;
}

Printing $output:

HTTP/1.1 200 OK
Date: Fri, 24 Jul 2009 15:20:38 GMT
Server: Apache/2.2.8 (Unix) PHP/4.4.8
X-Powered-By: PHP/4.4.8
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

1f61
<html>
<head>
    <title>

    (...) html here (...)
            <td align='right'><font size='-1'>   18,65</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   24,10</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   18,40</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   24,10</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   24,10</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   18,65</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>
f43
   24,10</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   18,65</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   18,65</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   24,10</font></td>
    (...) html here (...)
            <td align='right'><font size='-1'>   18,40</font></td>

    (...) html here (...)
   </body>
</html>

0

Note the block below:

            <td align='right'><font size='-1'>
f43
   24,10</font></td>

It is not present on the original HTML. Should be

            <td align='right'><font size='-1'>   24,10</font></td>

just like on others td tags.

To fix this issue, I replaced sockets by curl.

function get_content($a, $b, $c = "00")
{
    $args  = "arg01="  . $a;
    $args .= "&arg02="  . $b;
    $args .= "&arg03="  . $c;

    $host = "http://www.site.com/page.php";

    $ch = curl_init($host);
    curl_setopt($ch, CURLOPT_URL, $host);
    curl_setopt($ch, CURLOPT_POST, count($args));
    curl_setopt($ch, CURLOPT_POSTFIELDS, $args);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_exec($ch);
    $output = curl_multi_getcontent($ch);
    curl_close($ch);

    return $output;
}

Although CURL fixes the problem, I'd like to understand what is happening here. What is the origin of this "extra information"? From a wormhole or something like that?

Do you have any thoughts?

Thanks!

+5  A: 

The server you get the content from transfers the data in chunked mode (you can check this because there is this header:

Transfer-encoding: chunked

).

Chunked transfer encoding works this way: the server sends an hexadecimal number (in ascii characters) representing the length of the next chunk. Then it sends a CRLF (\r\n), then the chunk, then CRLF again, and then it starts all over again. Example:

10
1234567890abcdef
0C
qwertyuiopas

CURL handles this, but you're reading the raw socket data, so it appears in the content you're retrieving.

So, the additional "f43" you noticed, which was not in the original HTML, is actually the length of the following chunk (3907 bytes).

Anyway, it's a good idea to use CURL or another HTTP library, because implementing a protocol with all its subtilities (like chunked transfer-encoding in HMTL) is a lot of work, much more than twice the work to implement a basic protocol handler, which will work only on one basic case.

FWH
To complicate it more, our recent upgrade of PHP (5.2.9, IIRC) introduced a problem wherein fpassthru() no longer handled chunked data -- it started passing the lengths onto the browser. I had to forego fpassthru() and dechunk the response myself before echoing each chunk.
grantwparks