views:

138

answers:

6

To preface, I'm aware (as should you!) that using SELECT * in production is bad, but I was maintaining a script written by someone else. And, I'm also aware that this question is low on specifics... But hypothetical scenario.

Let's say I have a script that selects everything from a table of 20 fields. Let's say typical customer information.

Then let's say being the good developer I am, I shorten the SELECT * to a SELECT of the 13 specific fields I'm actually using on the display end.

What type of performance benefit, if any, could I expect by explicitly listing the fields versus SELECT *?

I will say this, both queries take advantage of the same exact indexes. The more specific query does not have access to a covering index that the other query could not use, in case you were wondering.

I'm not expecting miracles, like adding an index that targets the more specific query. I'm just wondering.

A: 

Why don't you try it yourself and let us know?

It's all going to be dependent on how many columns and how wide they are.

Better still, do you have an actual performance problem? Tell us what your actual problem is and show us the code, and then we can suggest potential improvements. Chances are there are other improvements to be made that are much better than worrying about SELECT * vs. SELECT field list.

Andy Lester
+1  A: 

The main difference I would expect to see is reduced network traffic. If any of the columns are large, they could take time to transfer, which is of course a complete waste if you're not displaying them.

It's also fairly critical if your database library references columns by index (instead of name), because if the column order changes in the database, it'll break the code.

Coding-style wise, it allows you to see which columns the rest of the code will be using, without having to read it.

Eric
+7  A: 

It depends on three things: the underlying storage and retrieval mechanism used by your database, the nature of the 7 columns you're leaving out, and the number of rows returned in the result set.

If the 7 (or whatever number) columns you're leaving out are "cheap to retrieve" columns, and the number of rows returned is low, I would expect very little benefit. If the columns are "expensive" (for instance, they're large, or they're BLOBs requiring reference to another file that is never cached) and / or you're retrieving a lot of rows then you could expect a significant improvement. Just how much depends on how expensive it is in your particular database to retrieve that information and assemble in memory.

There are other reasons besides speed, incidentally, to use named columns when retrieving information having to do with knowing absolutely that certain columns are contained in the result set and that the columns are in the desired order that you want to use them in.

Larry Lustig
+1  A: 

Hmm, in one simple experiment, I was surprised at how much difference it made.

I just did a simple query with three variations:

  1. select *
  2. select the field that is the primary key. (It might pull get this directly from the index without actually reading the record)
  3. select a non-key field.

I used a table with a pretty large number of fields -- 72 of them -- including one CLOB. The query was just a select with one condition in the where clause.

Results:

Run  *     Key   Non-key
1   .647  .020  .028
2   .599  .041  .014
3   .321  .019  .027
avg .522  .027  .023

Key vs non-key didn't seem to matter. (Which surprises me.) But retrieving just one field versus select * saved 95% of the runtime!

Of course this is one tiny experiment with one table. There could be many many relevant factors. I'm certainly not claiming that you will always reduce runtime by 95% by not using select *! But it's far more impressive than I expected.

Jay
Did you try running the Non-Key or Key query first, to ensure that there was no caching benefit to them compared with the '*'? It is surprising that the difference was so great.
Jonathan Leffler
Considering the non-key column came back in run 2 much faster than the key column (assuming that's not a typo) I think caching is implicated in some of the difference in speed.
Larry Lustig
@Jonathan: I ran them multiple times in different orders, to avoid getting misleading results because of caching. I ran each one twice in a row to see if I'd get different results the second time, etc. Besides caching the other obvious issue is what else happenned to be hitting the database at the same time. I certainly don't claim that my results here are definitive -- just interesting.
Jay
@Larry: As the difference between key and non-key is small, I think that's most likely just the sort of random sampling error you always get, and that the two are really about the same. Note the second number on key is what pushes it's average up. I looked at the explain plan and it said that it was doing a full-file sequential read for all of them -- my selection criteria was on a field that I think is non-indexed, I'd have to check. Maybe a search on a key field in a certain range would have given different results.
Jay
It occurs to me that the inclusion of the CLOB might be the major source of difference. That probably forces it to go out to another file or region. I'll have to repeat the experiment with a file that doesn't have a CLOB.
Jay
+1  A: 

When comparing 13 vs 20 fields, if the 7 fields that are left out are not fields such as CLOB/BLOBs or such, I would expect to see no noticable performance gain.

I/O is main DB bottleneck (most DB systems are I/O bound), so you might think that you would bring execution time to 13/20 of the original query execution time (since you need that much less data), but since normal fields are stored within the same physical structure (usually fields are arranged consecutively) and the file system reads whole blocks, your disk heads will read the same amount of data (assuming all 20 fields are less then block size; situation can change if the size of a record is bigger than a block of your filesystem).

The principle that SELECT * is bad has a different cause - stability of the system.

If you use SELECT * at wrong places then changes to underlying table(s) might break your system unexpectedly (mostly later, and if things break it is usually better if they break sooner). This can especially be intresting if normalize data (move columns from one table to another, while keeping the same name). In such case if you chain SELECT * in views and if you chain your views then you might actually not get any errors, but have (essentially) different end results.

Unreason
A: 

Select * means the database has to take time to lookup the fields. If you don't need all those fields (and anytime you have have an inner join you don't as the join field is repeated!) then you are wasting but server resources to get the data and network resources to transport the data. You may also be wasting memory to hold the recordset to work with it. And while the performance improvement may be tiny for one query, how many times is that query run? And people who use this abysmally poor technique tend to use it everywhere, so fixing all of them can be a major imporvement for not that much effort. And how hard is it to specify the fields? I don't know about every database, but in SQL Server I can drag and drop what I want from the object browser in seconds. So using select * is trading less than a minute of development time for a worse performance every single time the query is run and creating code that is fragile and subject to very bad problems as the schema changes. I see no reason to ever use select * in production code.

HLGEM