views:

45

answers:

2

Currently I'm using PHP to load multiple XML files from around the web (non-local) using simplexml_load_file(). This, as you can imagine, is quite a clunky process and is slowing load time significantly (7 seconds to load 7 files), and there could possibly be more files to load. These files don't change often, but changes should be displayed on the page as soon as they are made.

One idea I had was to cache a version of each feed and the html output I generate from that feed in my DB. Then, each time the user loads the page, the feeds would be compared; if they are different I would run my existing code, generate the HTML, output it, and save it to the DB. However, if it is the same, I could simply output the cached HTML.

My two concerns with this are:

Security: If I am storing a copy of an XML file, could this pose a security threat, seeing as I don't control the content of that file?

Speed: The main goal here is to increase the speed of the overall page load. Would the process described above increase the speed, or would it just bog down the server with more to do? Thanks for your help!

+2  A: 

How about having a cron job crawl through every external XML source, say, hourly or quarter-hourly and update it if necessary?

It wouldn't be in 100% real time, but would take the load off your web page - that would always be using cached files. I don't think there is a reliable way of polling external sources for updates other than actually downloading the file (in theory, it should be possible to get the correct cache headers, but I wouldn't rely on them being configured correctly.)

Security: If I am storing a copy of an XML file, could this pose a security threat, seeing as I don't control the content of that file?

Hardly. To make totally sure, store the cached XML files outside the web root. The any threat that remains then is the same as if you were passing the stream through live.

Unicron
Would this still be realistic if there were possibly 100's of files all at different URL's? (You see, each user gets to have is own set of files, so there will be an awful lot of them.)
WillyG
@iMaster sure! You'll have a certain amount of data lying around on the server, but as long as it doesn't take up too much space, it shouldn't be a problem.
Unicron
Turns out, you were right about the accuracy of cache headers. I'm still hesitant about cron jobs because it seems like anyone visiting while the script is running, it would be a lot worse because the script would be running the job for EVERY user in the database. Perhaps I'm just misunderstanding what you're saying. Regardless, could you post a link where I could read a little more into using php with cron jobs? Thanks!
WillyG
+1  A: 

One idea I had was to cache a version of each feed and the html output I generate from that feed in my DB. Then, each time the user loads the page, the feeds would be compared; if they are different I would run my existing code, generate the HTML, output it, and save it to the DB. However, if it is the same, I could simply output the cached HTML.

Rather than caching the XML file yourself, you should set the If-None-Match or If-Modified-Since fields in the request header. This way you can check to see if the files have changed without necessarily downloading them.

This can be done by setting a stream context for libxml before running simplexml_load_file(). If the file hasn't changed, you'll get a 304 Not Modified response, and simplexml_load_file will fail.

You could also use stream_context_get_default to set the general stream context, then retrieve the XML file into a string with file_get_contents and pass it to simplexml_load_string().

Here's an example of the first way:

Class CachedXml {
    public $element,$url;

    private $mod_date, $etag;

    public function __construct($url){
        $this->url = $url;
        $this->element = NULL;
        $this->mod_date = FALSE;
        $this->etag = FALSE;
    }

    public function updateXml(){
        if($this->mod_date || $this->etag){
            $opts = array(
                'http'=>array(
                'header'=>"If-Modified-Since: $this->mod_date\r\n" .
                          "If-None-Match: $this->etag\r\n"
                )
            );
            $context = stream_context_create($opts);
            libxml_set_streams_context($context);
        }
        if($attempt = @ simplexml_load_file($this->url)){
            $this->element = $attempt;
            $headers = get_headers($this->url,1);
            $this->mod_date = $headers['Last-Modified'];
            $this->etag = $headers['ETag'];
            return TRUE;
        }
        return FALSE;
    }
}

$bob = new CachedXml('http://example.com/xml/test.xml');

if($bob->updateXml()){
    echo "Bob was just updated.<br />";
    echo " Bob's name is " . $bob->element->getName() . ".<br />";
}
else{
    echo "Bob was not updated.<br />";
}
GoalBased
I think I'm going to dumb this down a bit and just use the bit with get_headers and then comparing the Last-Modified to the cache date. As far as I can tell, that gives me exactly what I want.
WillyG
Turns out the headers aren't that accurate...thanks for the suggestion, though!
WillyG