views:

303

answers:

4

I'm working on selecting locations (city, state) out of a database. The problem is that the query is running a tad slow and I'm not sure how to speed it up. For example:

SELECT CONCAT_WS(', ', city, state) as location, AVG(latitude), AVG(longitude) 
FROM places
WHERE city='New York' AND state='NY'
GROUP BY location

There's going to be a CONCAT on the location regardless, because I want the database to return a pretty, concatenated version of the location (unless there's reason to do this in the code instead). For example, "New York, NY". In reality, a third column is sometimes thrown into the mix (zipcode). I'm running on MySQL.

What would be the best way to optimize this query?

Also, as a secondary question, would adding "DISTINCT" slow down the query in any way? For example:

SELECT DISTINCT CONCAT_WS(', ', city, state) as location, AVG(latitude), AVG(longitude) 
FROM places
WHERE city='New York' AND state='NY'
GROUP BY location

(I'm currently doing this now, but in the process of asking this question, I realized that DISTINCT was not necessary due to the GROUP BY clause; however, since it is unnecessary I wonder if it makes any difference at all and if I should bother rocking the boat in order to speed up the query.)

Edit: There's already an index on city, state and zipcode; plus their combinations therein (city, zipcode; and state/zipcode alone).

A: 

One way to best optimize a query like this is to have those columns set as Index columns. That way it can easily sort/group based off of a tree or hash. Also, concatenation of strings might have some implications too.

Daniel A. White
I've already got the columns indexed. Do you think it'd be better to concatenate the strings in the code rather than in the db?
+4  A: 

Create a composite index on (state, city) and rewrite your query as this:

SELECT  CONCAT_WS(', ', city, state) AS location, AVG(latitude), AVG(longitude) 
FROM    places
WHERE   state='NY'
        AND city='New York'
GROUP BY
        state, city

Note that for this very query you may omit GROUP BY clause:

SELECT  'New York, NY' AS location, AVG(latitude), AVG(longitude) 
FROM    places
WHERE   state='NY'
        AND city='New York'

However, this query will still need it:

SELECT  CONCAT_WS(', ', city, state) AS location, AVG(latitude), AVG(longitude) 
FROM    places
WHERE   state='NY'
GROUP BY
        state, city
Quassnoi
When you GROUP BY multiple columns, will that have the same logical effect as GROUP BY location?
Yes, except for the really weird cases (like, when you have `state = 'New York, NY' AND city = ''` )
Quassnoi
Given the way my geocoder works, there's no way that a comma will end up in either state or city (it assumes that a "," is a hard separator between two tokens). So that sounds pretty good to me.
Doing this seems to have greatly increased my Handler_read_rnd_next... Is this a bad thing? I'm trying to figure out why changing the code would affect this variable.
@Roger: could you please build `EXPLAIN` for both queries (your original one and mine)?
Quassnoi
A: 

Adding an index on fields "city" and "state" will help.

Also, depending on the cardinality of each field (number of distinct values), on the version of MySQL, the table engine and other parameters, inverting the WHERE clauses may have an impact on the execution time of your query. I'd try:

WHERE state='NY' AND city='New York'
Nicolas
+2  A: 

It's funny, but almost every problem people have with databases is speed, not storage requirements. That should tell you something :-)

We've had problems like this before and I've said this many times: per-row functions generally do not scale well. The best method we've found to fix them is with insert/update triggers (I'm assuming MySQL has these).

Create another column call pretty_city_state (or whatever) and have the triggers populate it from city and state whenever you insert or update a row. Then create an index on it.

This takes advantage of the fact that database rows are generally read far more often than they're written (especially in this case). By evaluating that column on write, you bear the cost across the writes (thousands) rather than the reads (millions probably). And it's the writing when it should be borne simply because pretty_city_state will only change when either city or state changes. If you do the concat on every select, you're wasting effort.

Try that and measure the difference - I'm sure you'll find that your selects will scream along at a minimal cost for the triggers (and that cost disappears totally once you have all cities and states within your database.

And yes, I know this breaks 3NF. It's perfectly acceptable to do so for performance reasons if you know what you're doing.

Your query could be done as:

SELECT pretty_city_state as location, AVG(latitude), AVG(longitude) 
FROM places
WHERE city='New York' AND state='NY'
GROUP BY pretty_city_state

or, maybe even faster (measure, don't guess) if you can concatenate the city and state before starting the query:

SELECT pretty_city_state as location, AVG(latitude), AVG(longitude) 
FROM places
WHERE pretty_city_state ='New York, NY'
GROUP BY pretty_city_state
paxdiablo
The issue is that we sometimes change what is concatenated, depending on what is asked. Namely, if the user doesn't include a zipcode, we don't return a zipcode in our response; but if they do, we do. Do you think we should make multiple CONCAT rows?
Yes, definitely. Storage is cheap, CPU grunt is not. And once you've made the decision to vioalte 3NF for performance, you may as well go all the way :-)
paxdiablo
For bonus points (since I'm obviously somewhat of an SQL newbie), what's the UPDATE query one could do to add pretty_city_state to each row (supposing I've created this column already; I think I can handle that ;) ).
Something like "update places set pretty_city_state = concat_ws(', ',city,state);"
paxdiablo
But, unless your DBMS is too smart by half, just create the triggers then "update places set city = city;" and let the trigger take care of it.
paxdiablo