views:

163

answers:

3

Is there an appreciable performance difference between having one SELECT foo, bar, FROM users query that returns 500 rows, and 500 SELECT foo, bar, FROM users WHERE id = x queries coming all at once?

In a PHP application I'm writing, I'm trying to choose between a writing clear, readable section of code that would produce about 500 SELECT statements; or writing a it in an obscure, complex way that would use only one SELECT that returns 500 rows.

I would prefer the way that uses clear, maintainable code, but I'm concerned that the connection overhead for each of the SELECTs will cause performance problems.

Background info, in case it's relevant: 1) This is a Drupal module, coded in PHP 2) The tables in question get very few INSERTs and UPDATEs, and are rarely locked 3) SQL JOINs aren't possible for reasons not relevant to the question

Thanks!

+5  A: 

It's almost always faster to do one big batch SELECT and parse the results in your application code than doing a massive amount of SELECTs for one row. I would recommend that you implement both and profile them, though. Always strive to minimize the number of assumptions you have to make.

Emil H
Thanks! I suspected as much. Could you give a sense of *why* it's better though, for example because of the connection overhead? I'd like a better understanding of how exactly one way is better.
anschauung
CPU and Memory is always less costly than any form of IO, including network IO. Think about it like this: It's definitely less expensive to send a single order for 500 books and receive them in one huge package over the mail, than to send 500 orders and receive all the books in separate packaging. :)
Emil H
+1  A: 

It seems like you know what the 500 id values are, so why not do something like this:

// Assuming you have already validated that this array contains only integers
// so there is not risk of SQl injection

$ids = join(',' $arrayOfIds);

$sql = "SELECT `foo`, `bar` FROM `users` WHERE `id` IN ($ids)";
Doug Neiner
I don't know the IDs in this case, but that's a neat technique. I'll have to remember it.
anschauung
If you don't have ids, how can you run `SELECT foo, bar, FROM users WHERE id = x`?
Doug Neiner
I see what you mean. What I meant to write was that I can *get* the ids, but the most straightforward way to do so is to get them one at a time. I don't have them all in one handy batch.
anschauung
So does that mean you are running 500x2 queries? 1 lookup for `id` and 1 SELECT with the `id` ?
Doug Neiner
At the root of it, yes. As a practical matter, I'm plugging into a larger system that has the data arranged in a different format than I need for this particular operation. The larger application would have looked up the id anyway, so I'm not really counting that as a second query for the narrow purposes of this question
anschauung
A list of 500 ids is not very big. I've created `where id in ( $list )` predicates having tens of thousands of items in the list.
memnoch_proxy
+2  A: 

I would not worry about the connection overhead of mysql queries too much, especially if you are not closing the connection between every query. Consider that if your query creates a temporary table, you've already spent more time in the query than the overhead of the query took.

I love doing a complex SQL query, personally, but I have found that the size of the tables, mysql query cache and query performance of queries that need to do range checking (even against an index) all make a difference.

I suggest this:

1) Establish the simple, correct baseline. I suspect this is the zillion-query approach. This is not wrong, and very likely helfully correct. Run it a few times and watch your query cache and application performance. The ability to keep your app maintainable is very important, especially if you work with other code maintainers. Also, if you're querying really large tables, small queries will maintain scalability.

2) Code the complex query. Compare the results for accuracy, and then the time. Then use EXPECT on the query to see what the rows scanned are. I have often found that if I have a JOIN, or a WHERE x != y, or a condition that creates a temporary table, the query performance could get pretty bad, especially if I'm in a table that's always getting updated. However, I've also found that a complex query might not be correct, and also that a complex query can more easily break as an application grows. Complex queries typically scan larger sets of rows, often creating temporary tables and invoke using where scans. The larger the table, the more expensive these get. Also, you might have team considerations where complex queries don't suit your team's strengths.

3) Share the results with your team.

Complex queries are less likely to hit the mysql query cache, and if they are large enough, don't cache them. (You want to save the mysql query cache for frequently hit queries.) Also, query where predicates that have to scan the index will not do as well. (x != y, x > y, x < y). Queries like SELECT foo, bar FROM users WHERE foo != 'g' and mumble < '360' end up doing scans. (The cost of query overhead could be negligible in that case.)

Small queries can often complete without creating temporary tables just by getting all values from the index, so long as the fields you're selecting and predicating on are indexed. So the query performance of SELECT foo, bar FROM users WHERE id = x is really great (esp if columns foo and bar are indexed like, aka alter table users add index ix_a ( foo, bar );.)

Other good ways to increase performance in your application would be to cache those small query results in the application (if appropriate), or doing batch jobs of a materialized view query. Also, consider memcached or some features found in XCache.

memnoch_proxy