views:

387

answers:

5

I'm writing a semi-simple database wrapper class and want to have a fetching method which would operate automagically: it should prepare each different statement only the first time around and just bind and execute the query on successive calls.

I guess the main question is: How does re-preparing the same MySql statement work, will PDO magically recognize the statement (so I don't have to) and cease the operation?

If not, I'm planning to achieve do this by generating a unique key for each different query and keep the prepared statements in a private array in the database object - under its unique key. I'm planning to obtain the array key in one of the following ways (none of which I like). In order of preference:

  • have the programmer pass an extra, always the same parameter when calling the method - something along the lines of basename(__FILE__, ".php") . __LINE__ (this method would work only if our method is called within a loop - which is the case most of the time this functionality is needed)
  • have the programmer pass a totally random string (most likely generated beforehand) as an extra parameter
  • use the passed query itself to generate the key - getting the hash of the query or something similar
  • achieve the same as the first bullet (above) by calling debug_backtrace

Has anyone similar experience? Although the system I'm working for does deserve some attention to optimization (it's quite large and growing by the week), perhaps I'm worrying about nothing and there is no performance benefit in doing what I'm doing?

A: 

To my knowledge PDO does not reuse already prepared statements as it does not analyse the query by itself so it does not know if it is the same query.

If you want to create a cache of prepared queries, the simplest way imho would be to md5-hash the query string and generate a lookup table.

OTOH: How many queries are you executing (per minute)? If less than a few hundred then you only complicate the code, the performance gain will be minor.

dbemerlin
I may be missing something, but why create the md5 of the query string? Why not use the query string _itself_?
Inshallah
because the query string itself might be quite long, so a lookup would be slower. A hash will allow quicker lookup (of course if the queries are short enough and the lookups are few enough then it might actually be slower but in the usual case a hash search is faster)
dbemerlin
What? MD5ing the query string for performance? That's completely insane :-). I've seen the indexing by query string method used in Perl, to good effect. Are you sure you know enough about the way PHP arrays are implemented to make such a recommendation?
Inshallah
It _might_ be unnecessary for PHP as i haven't done any benchmarking for this specific case but imho it is always a good idea to use hash-values for lookups even if the array itself hashes as you otherwise might be tempted to do something like $key = <long query>; foreach ($stuff as $item) { $cache[$key]->do_stuff($item); } which will force PHP to rehash the query for each loop iteration to lookup the key. OTOH i usually don't care much about how the language itself implements some things so i might be wrong there: You should Benchmark it and find out.
dbemerlin
+1  A: 

Believe me, I've done this before and after building a cache of prepared statements the performance gain was very noticeable - see this question: Preparing SQL Statements with PDO.

An this was the code I came up after, with cached prepared statements:

function DB($query)
{
    static $db = null;
    static $result = array();

    if (is_null($db) === true)
    {
        $db = new PDO('sqlite:' . $query, null, null, array(PDO::ATTR_ERRMODE => PDO::ERRMODE_WARNING));
    }

    else if (is_a($db, 'PDO') === true)
    {
        $hash = md5($query);

        if (empty($result[$hash]) === true)
        {
            $result[$hash] = $db->prepare($query);
        }

        if (is_a($result[$hash], 'PDOStatement') === true)
        {
            if ($result[$hash]->execute(array_slice(func_get_args(), 1)) === true)
            {
                if (stripos($query, 'INSERT') === 0)
                {
                    return $db->lastInsertId();
                }

                else if (stripos($query, 'SELECT') === 0)
                {
                    return $result[$hash]->fetchAll(PDO::FETCH_ASSOC);
                }

                else if ((stripos($query, 'UPDATE') === 0) || (stripos($query, 'DELETE') === 0))
                {
                    return $result[$hash]->rowCount();
                }

                else if (stripos($query, 'REPLACE') === 0)
                {
                }

                return true;
            }
        }

        return false;
    }
}

Since I don't need to worry about collisions in queries, I've ended up using md5() instead of sha1().

Alix Axel
I'm very interested to know about why you made the decision to use MD5 hashing. I've posted a naive benchmark of MD5 hashing vs PHP native arrays, and could find any improvement with the MD5 method.
Inshallah
@Inshallah: The decision to use `md5()` was just so I wouldn't have new lines (`\n`) in queries.
Alix Axel
+2  A: 

MySQL (like most DBMS) will cache execution plans for prepared statements, so if user A creates a plan for:

SELECT * FROM some_table WHERE a_col=:v1 AND b_col=:v2

(where v1 and v2 are bind vars) then sends values to be interpolated by the DBMS, then user B sends the same query (but with different values for interpolation) the DBMS does not have to regenerate the plan. i.e. its the DBMS which finds the matching plan - not PDO.

However this means that each operation on the database requires at least 2 round trips (1st to present the query, the second to present the bind vars) as opposed to a single round trip for a query with literal values, then this introduces additional network costs. There is also a small cost involved in dereferencing (and maintaining) the query/plan cache.

The key question is whether this cost is greater than the cost of generating the plan in the first place.

While (in my experience) there definitely seems to be a performance benefit using prepared statements with Oracle, I'm not convinced that the same is true for MySQL - however a lot will depend on the strucutre of your database and the complexity of the query (or more specifically how many different options the optimizer can find for resolving the query).

Try measuring it yourself (hint you might want to set the slow query threshold to 0 and write some code to convert literal values back into anonymous representations for the queries written to the logs)

C.

symcbean
+1  A: 

OK, since I've been bashing methods of keying the queries for the cache, other than simply using the query string itself, I've done a naive benchmark. The following compares using the plain query string vs first creating the md5 hash:

$ php -v
$ PHP 5.3.0-3 with Suhosin-Patch (cli) (built: Aug 26 2009 08:01:52)
$ ...
$ php benchmark.php
$ PHP hashing: 0.19465494155884 [microtime]
$ MD5 hashing: 0.57781004905701 [microtime]
$ 799994

The code:

<?php
error_reporting(E_ALL);

$queries = array("SELECT",
                 "INSERT",
                 "UPDATE",
                 "DELETE",
                 );
$query_length = 256;
$num_queries  = 256;
$iter = 10000;

for ($i = 0; $i < $num_queries; $i++) {
    $q = implode('',
           array_map("chr",
             array_map("rand",
                       array_fill(0, $query_length, ord("a")),
                       array_fill(0, $query_length, ord("z")))));
    $queries[] = $q;
}

echo count($queries), "\n";

$cache = array();
$side_effect1 = 0;
$t = microtime(true);
for ($i = 0; $i < $iter; $i++) {
    foreach ($queries as $q) {
        if (!isset($cache[$q])) {
            $cache[$q] = $q;
        }
        else {
            $side_effect1++;
        }
    }
}
echo microtime(true) - $t, "\n";

$cache = array();
$side_effect2 = 0;
$t = microtime(true);
for ($i = 0; $i < $iter; $i++) {
    foreach ($queries as $q) {
        $md5 = md5($q);
        if (!isset($cache[$md5])) {
            $cache[$md5] = $q;
        }
        else {
            $side_effect2++;
        }
    }
}
echo microtime(true) - $t, "\n";

echo $side_effect1 + $side_effect2, "\n";
Inshallah
Thanks for the benchmark, sadly i couldn't get it to work here as $cache never had any data and $q was only "DELETEArray" but i will investigate this further as soon as i have the opportunity
dbemerlin
@dbemerlin, yes, sorry. It should work now. I edited the code a bit after copy/pasting from emacs, introducing some errors in the process :-).
Inshallah
I tested it now based on your script with those modifications: create 100 random strings by concatenating 20 times a random string from $queries. foreach of these random strings loop $iter times and each time check if the string exists in $cache, if not set it. for md5: foreach of the random strings create the hash and then loop $iter times and do the same. Result: with $iter=1 the md5 hash is 0.00008ms slower, for low values of $iter (100 and lower) there is no real performance difference (<0.1s), with increasing amount of iterations md5 gains more advantage.So: Depends on use-case.
dbemerlin
I ran Instalah's benchmark with varied `$length` and `$iter` values and it seems PHP's caching wins every time. I guess I'll be accepting his answer as he was right in the end having pointed out the final solution (use whole queries as keys in cache) in the comments of the original question. When I finish the wrapper class, I'll try to post further benchmarks whether it was worth caching statements at all (can't promise anything though :)
Raveren
@dbemerlin, thanks for your feedback. I pumped up the number of queries to 256 with some really long ones but, for me, MD5 consistently lost.
Inshallah
I'd recommend benchmarking with md5-strings, too, as they might (or might not) be faster for your use case, especially if you have the possibility to reuse the hash (i.e. in a loop) and they might (or might not) be faster if you have more than the 4 different queries in inshallahs benchmark.
dbemerlin
You guys are aware associative arrays are implemented as /hashmaps/? Since the resulting md5 hash is then again hashed with whatever PHP uses internally to maintain it's hashmaps there can only be a performance gain in using md5 upfront if a) the PHP internal hashing is of lower performance than the exposed md5 implementation (including all the overhead of calling the exposed md5 function, allocating enough memory on the heap for the resulting md5 hash, etc) and b) the input to md5 is significantly larger than a md5 hash - no matter what (because the resulting md5 will be hashed again).
Daniel Baulig
@Daniel I have no knowledge whatsoever about how PHP implements associative arrays, but I thought it was something like that. :-)
Inshallah
A: 

Using a MD5 hash as a key you could eventually get two queries that result in the same MD5 hash. The probability is not high, but it could happen. Don't do it. Lossful hashing algorithms like MD5 is just ment as a way to tell if two objects are different with high certainty, but are not a safe means of identifying something.

Daniel Alvarez Arribas