views:

128

answers:

5

It seems that all questions regarding this topic are very specific, and while I value specific examples, I'm interested in the basics of SQL optimization. I am very comfortable working in SQL, and have a background in hardware/low level software.

What I want is the tools both tangible software, and a method to look at the mysql databases I look at on a regular basis and know what the difference between orders of join statements and where statements.

I want to know why an index helps, like, exactly why. I want to know specifically what happens differently, and I want to know how I can actually look at what is happening. I don't need a tool that will breakdown every step of my SQL, I just want to be able to poke around and if someone can't tell me what column to index, I will be able to get out a sheet of paper and within some period of time be able to come up with the answers.

Databases are complicated, but they aren't THAT complicated, and there must be some great material out there for learning the basics so that you know how to find the answers to optimization problems you encounter, even if could hunt down the exact answer on a forum.

Please recommend some reading that is concise, intuitive, and not afraid to get down to the low level nuts and bolts. I prefer online free resources, but if a book recommendation demolishes the nail head it hits I'd consider accepting it.

+6  A: 

Let's say you're looking for a friend in another city. One way would be to go from door to door and ask whether this is the house you're looking for. Another way is to look at the map.

The index is the map to a table. It can tell the DB engine exactly where the thing you're looking for is. Thus, you index every column that you think you will have to search for, and leave out the columns that you are just reading data from, and never searching for.

Good technical reading about indices and about ORDER BY optimization. And if you want to see what exactly is happening, you want the EXPLAIN statement.

Amadan
Also, the mysql slow log is worth observing. http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
Pete
I'm interested particularly in how indices will affect joins, I use joins a lot and don't really understand how they work at a low level. For example, does it matter if you have two indexed columns that may be very large joining on each-other? How does space for joins become allocated and traversed? What if they are both indexed, what if neither are indexed?
walnutmon
Basically, the whole chapter 7.2 of the MySQL manual is interesting. If a column is not indexed, you need at most n comparisons to find something. If it is, you need at most log(n) comparisons. Length of the datum is definitely a factor, but index is more important. However, I find I almost never join on non-integer fields. My policy is, if it has a non-trivial chance of repeating itself, it should have a table and a primary key. And "what if" questions like yours are best answered by building the model and running `EXPLAIN` on sample queries.
Amadan
To expand a bit, a join is normally a double search; everything that pertains to searching pertains doubly to joins; thus whatever you join on, it better be indexed.
Amadan
I've not yet had a chance to look through the documentation you've provided, I'll update when I do
walnutmon
+2  A: 

Don't think about optimizing databases. Think about optimizing queries.

Generally, you optimize one case at the expense of others. You just have to decide which cases you're interested in.

harpo
+1  A: 

"I'm interested particularly in how indices will affect joins"

As an example, I'll take the case of equijoin (SELECT FROM A,B WHERE A.x = B.y).

If there are no indexes at all (which is possible in theory but I think not in SQL), then basically the only way to compute the join is to take the entire table A and partition it over x, take the entire table y and partition it over y, then match the partitions, and finally for each pair of matching partitions compute the result rows. That's costly (or even outright impossible due to memory restrictions) for all but the smallest tables.

Same story if there do exist indexes on A and/or B, but not any of them has x resp. y as its first attribute.

If there does exist an index on x, but not on y (or conversely), then another possibility opens up : scan table B, for each row pick value y, lookup that value in the index and fetch the corresponding A rows to compute the join. Note that this still won't win you much if no other further restrictions apply (AND z = ...) - except in the case where there are only few matches between x and y values.

If ordered indexes (hash-based indexes are not ordered) exist on both x and y, then a third possibility opens up : do a matching scan on the indexes themselves (the indexes themselves are likely to be smaller than the tables themselves, so scanning the index itself will take a shorter time), and for the matching x/y values, compute the join of the corresponding rows.

That's the baseline. Variations arise for joins on x>y etc.

Erwin Smout
+1  A: 

I don't know about MySql tools but in MS SqlServer you have a tool that shows all of the operations a query would take and how much of the processing time of the entire query would take.

Using this tool helped me to understand how queries are optimized by the query optimizer much more than I think any book could help because what the optimizer does is often not easy to understand. By tweaking the query and possibly the underlining database I could see how each change affected the query plan. There are certain key points in writing queries but to me it looks like you already have an idea of those so optimizing in your case is much more about this than any general rules. After a few years of db development I did look at a few books specifically aimed at database optimization on the SQL Server and found very little useful info.

Quick googling came up with this: http://www.mysql.com/products/enterprise/query.html which sounds like a similar tool.

This was of course on a query level, database level optimizations are again a different kettle of fish, but there you are looking at parameters such as how your database is divided on the hard drives etc. At least in SqlServer you can select to divide tables to different hdd's and even disk plates and this can have a big effect because the drives and drive heads can work in parallel. Another is how you can build your queries so that the database can run them in several threads and processors in parallel, but both of these issues again depend on the database engine and even version you are using.

Makis
+3  A: 

You have to do a look up for every where condition and for every join...on condition. The two work the same.

Suppose we write

select name
from customer
where customerid=37;

Somehow the DBMS has to find the record or records with customerid=37. If there is no index, the only way to do this is to read every record in the table comparing the customerid to 37. Even when it finds one, it has no way of knowing there is only one, so it has to keep looking for others.

If you create an index on customerid, the DBMS has ways to search the index very quickly. It's not a sequential search, but, depending on the database, a binary search or some other efficient method. Exactly how doesn't matter, accept that it's much faster than sequential. The index then takes it directly to the appropriate record or records. Furthermore, if you specify that the index is "unique", then the database knows that there can only be one so it doesn't waste time looking for a second. (And the DBMS will prevent you from adding a second.)

Now consider this query:

select name
from customer
where city='Albany' and state='NY';

Now we have two conditions. If you have an index on only one of those fields, the DBMS will use that index to find a subset of the records, then sequentially search those. For example, if you have an index on state, the DBMS will quickly find the first record for NY, then sequentially search looking for city='Albany', and stop looking when it reaches the last record for NY.

If you have an index that includes both fields, i.e. "create index on customer (state, city)", then the DBMS can immediately zoom to the right records.

If you have two separate indexes, one on each field, the DBMS will have various rules that it applies to decide which index to use. Again, exactly how this is done depends on the particular DBMS you are using, but basically it tries to keep statistics on the total number of records, the number of different values, and the distribution of values. Then it will search those records sequentially for the ones that satisfy the other condition. In this case the DBMS would probably observe that there are many more cities than there are states, so by using the city index it can quickly zoom to the 'Albany' records. Then it will sequentially search these, checking the state of each against 'NY'. If you have records for Albany, California these will be skipped.

Every join requires some sort of look-up.

Say we write

select customer.name
from transaction
join customer on transaction.customerid=customer.customerid
where transaction.transactiondate='2010-07-04' and customer.type='Q';

Now the DBMS has to decide which table to read first, select the appropriate records from there, and then find the matching records in the other table.

If you had an index on transaction.transactiondate and customer.customerid, the best plan would likely be to find all the transactions with this date, and then for each of those find the customer with the matching customerid, and then verify that the customer has the right type.

If you don't have an index on customer.customerid, then the DBMS could quickly find the transaction, but then for each transaction it would have to sequentially search the customer table looking for a matching customerid. (This would likely be very slow.)

Suppose instead that the only indexes you have are on transaction.customerid and customer.type. Then the DBMS would likely use a completely different plan. It would probably scan the customer table for all customers with the correct type, then for each of these find all transactions for this customer, and sequentially search them for the right date.

The most important key to optimization is to figure out what indexes will really help and create those indexes. Extra, unused indexes are a burden on the database because it takes work to maintain them, and if they're never used this is wasted effort.

You can tell what indexes the DBMS will use for any given query with the EXPLAIN command. I use this all the time to determine if my queries are being optimized well or if I should be creating additional indexes. (Read the documentation on this command for an explanation of its output.)

Caveat: Remember that I said that the DBMS keeps statistics on the number of records and the number of different values and so on in each table. EXPLAIN may give you a completely different plan today than it gave yesterday if the data has changed. For example, if you have a query that joins two tables and one of these tables is very small while the other is large, it will be biased toward reading the small table first and then finding matching records in the large table. Adding records to a table can change which is larger, and thus lead the DBMS to change its plan. Thus, you should attempt to do EXPLAINS against a database with realistic data. Running against a test database with 5 records in each table is of far less value than running against a live database.

Well, there's much more that could be said, but I don't want to write a book here.

Jay
Wow, that's a lot of information, thanks, I've learned a couple things from reading this that I can immediately use
walnutmon