views:

318

answers:

2

I'm trying to use squid to modify the page content of web page requests. I followed the upside-down-ternet tutorial which showed instructions for how to flip images on pages.

I need to change the actual html of the page. I've been trying to do the same thing as in the tutorial, but instead of editing the image I'm trying to edit the html page. Below is a php script I'm using to try to do it.

All jpg images get flipped, but the content on the page does not get edited. The edited index.html files written contain the edited content, but the pages the users receive don't contain the edited content.

#!/usr/bin/php
<?php
$temp = array();
while ( $input = fgets(STDIN) ) {
    $micro_time = microtime();

    // Split the output (space delimited) from squid into an array.
    $temp = split(' ', $input);

    //Flip jpg images, this works correctly
    if (preg_match("/.*\.jpg/i", $temp[0])) {
        system("/usr/bin/wget -q -O /var/www/cache/$micro_time.jpg ". $temp[0]);
        system("/usr/bin/mogrify -flip /var/www/cache/$micro_time.jpg");
        echo "http://127.0.0.1/cache/$micro_time.jpg\n";
    }

    //Don't edit files that are obviously not html. $temp[0] contains url of file to get
    elseif (preg_match("/(jpg|png|gif|css|js|\(|\))/i", $temp[0], $matches)) {
        echo $input;
    }   

    //Otherwise, could be html (e.g. `wget http://www.google.com` downloads index.html)
    else{ 
        $time = time() . microtime();       //For unique directory names
        $time = preg_replace("/ /", "", $time); //Simplify things by removing the spaces
        mkdir("/var/www/cache/". $time);    //Create unique folder
        system("/usr/bin/wget -q --directory-prefix=\"/var/www/cache/$time/\" ". $temp[0]);
        $filename = system("ls /var/www/cache/$time/");     //Get filename of downloaded file

        //File is html, edit the content (this does not work)
        if(preg_match("/.*\.html/", $filename)){

            //Get the html file contents  
            $contentfh = fopen("/var/www/cache/$time/". $filename, 'r');
            $content = fread($contentfh, filesize("/var/www/cache/$time/". $filename));
            fclose($contentfh);

            //Edit the html file contents
            $content = preg_replace("/<\/body>/i", "<!-- content served by proxy --></body>", $content);

            //Write the edited file
            $contentfh = fopen("/var/www/cache/$time/". $filename, 'w');
            fwrite($contentfh, $content);
            fclose($contentfh);

            //Return the edited page
            echo "http://127.0.0.1/cache/$time/$filename\n";
        }               
        //Otherwise file is not html, don't edit
        else{
            echo $input;
        }
    }
}
?>
A: 

Take a look at Dansguardian; it uses PCRE to modify content on the fly: link (look at the last 2 topics)

Ch4m3l3on
A: 

Not sure if its the cause of your problem, but there's quite a lot wrong with the code.

You seperate requests based on microtime - this will only work reliably if you have relatively low volumes of traffic - note that the original (perl) code may still break if there is more than one instance of the redirector running.

You've tried to identify the content type based on the file extension - this will work for files which match the list - but it doesn't follow that stuff which doesn't match the list must be text/html - really you should check the mimetype returned by the origin server.

You've got no error checking/debugging in the code - although you don't have an error stream you can easily write to, you could write the errors to a file, to the syslog, or fire out an email if the fopen/fread statements don't work, or if the stored file doesn't have a .html extension.

C.

symcbean