views:

316

answers:

3

Dropping my lurker status to finally ask a question...

I need to know how I can improve on the performance of a PHP script that draws its data from XML files.

Some background:

  • I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.

  • The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.

  • Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.

  • Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.

  • Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.

+1  A: 

If the XML stays relatively static, you could cache it as a PHP array, something like this:

<xml><foo>bar</foo></xml>

is cached in a file as

<?php return array('foo' => 'bar');

It should be faster for PHP to just include the arrayified version of the XML.

Jani Hartikainen
This is a good answer, but we're already doing that for several runs at once; the XML is expected to not stay particularly static for more than a few seconds, but we're allowing for a few minutes' worth of changes to slip by for a few runs. After that, we have to take up all the changes, which means recreating the array. Still very CPU intensive.
Polymeron
+3  A: 

Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?

It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.

Mathew Hall
This is what I was going to suggest. +1
ceejayoz
Yes, 100k objects are better kept in an embedded database, or a dedicated one if you can access it; then you can generate just the bits of the xml that the client needs.
Mercer Traieste
To clarify: The Flash interface and the runs are completely separate, except that the runs modify some data which will eventually be displayable. But the runs are independent of whether or not the objects are being queried by users.The data coming from XML is in its current state; when sent to Flash, it isn't modified. The users however have the ability to make changes to loaded files via the interface.The question is, faster user access not withstanding, does working with a DB speed up the *runs*? We're more concerned about that currently.
Polymeron
In the case of the actual runs it seems you might be able to gain a performance increase from a database; the overhead in loading the data will be significantly decreased vs. parsing the XML each time. At the very least this would reduce the cost of each run.
Mathew Hall
A: 

~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...

There are many questions. Some of them are:

  • Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
  • Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
  • What operations are performed on the data source?
  • Do you need some kind of concurrency mechanism?
  • ...

And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.

edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.

VolkerK
To clarify, the runs should work in the background, processing data that will be displayed to the users eventually.- We'll need every single object's data for every single run.- When users are viewing the interface, the interface calls specific XML files in order to know what to display.- No need for concurrency mechanisms - we're ok on that front, I think.Searchability is all good and well for the users, but would using the DB be more efficient for the background runs? That's the current concern.
Polymeron
If it's a background process why do you need to read/load the whole dataset repeatedly? If you say you have to we probably have to believe you ;-) but many times such a question is asked in php forums it boils down to "no, you don't need a (almost-)continuous simulation for that". Can you be more specific on the dataset and the operations you want to perform on each run?
VolkerK
Then I would try to get rid of the files or at least of the repeated load operations. I.e. a continuously running process that a) does the simulation, b) accepts and serves requests for subsets of the data and c) handles requests to modify the data. So instead of uploading a file (that is stored as a file on the server) this process would integrate the new data into its world model (and probably store it in a database as backup)
VolkerK