views:

257

answers:

7

if I just need 2/3 columns and I query SELECT * instead of providing those columns in select query. is there any performance degradation regarding more/less I/O or memory ??

the network overhead might be present if I do select * without a need.

But in a select operation does the database engine always pulls atomic tuple from the disk ? or it pulls only those columns requested in the select operation.

If it always pulls a tuple then I/O overhead is the same.

at the same time. there might be a memory consumption for stripping out the requested columns for a tuple. if it pulls a tuple.

So if thats the case select someColumn will have more memory overhead than that of select

+12  A: 

You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.

[edit]: Meant accessing. Stupid brain still waking up.

Donnie
+1 for an edge case that I believe not many will think of at first glance - indexes on the client side and added/changed columns.
Tomas Lycken
Yea, but is use of numeric indices for columns that common? I've always accessed column data using string keys or property names if using ORM.
Lèse majesté
saw this a long time ago, junior programmer selected * from a table and made assumptions about column order; all his code broke as soon as someone else changed the table. What fun we had.
Paul McKenzie
It's probably a bad idea to use column order in general just for code readability's sake, doubly bad to use `SELECT *` with it.
Lèse majesté
Wow, accessing columns by index in client code seems like a *phenomenally* bad idea. For that matter, relying on the order in which columns appear in a result set *in any way* feels very dirty to me.
Matt Peterson
+4  A: 

This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.

Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.

However, I will admit that I otherwise usually query for all the columns in a table.

LeguRi
LOB columns are always my favourite example of the perils of SELECT *. So I was about to upvote you, until I read the third paragraph. Tsk, tsk. What happens if some other developer adds a BLOB to a table which currently doesn't have such a column?
APC
@APC - This is true - but it's a confession; I know I should know better :'( Can you forgive me if I don't use `*` - I actually list the columns?
LeguRi
@APC, I wish I could upvote your comment more. Think of your poor coworker who just wants to add a column without causing a huge performance meltdown! Think of how angry they'll be when they discover after a few hours your innocent looking select *.
Mike Sherov
Ya If the table contains a LOB field select * should always be avoided. But I am talking about the generic theoretic aspect.
@user256007, yes, even without BLOB... BLOB just illustrates the extreme example. Check my response to Charles, there are times when selecting specific columns can enable you to grab the data from memory without even going to disk!
Mike Sherov
LeguRi
@Richard, I think they are great for when optimizing DB performance is not your main concern, which is 99% of the time. As with most frameworks, they tend to generalize things to enable faster development while sacrificing pure performance. As Knuth said: "Premature optimization is the root of all evil." When you get to the point where you need to worry about the performance of select columns vs. select *, (ask Twitter about RoR) you can worry about it and optimize it then. If the framework isn't robust enough to support that, then I'd say you're using the wrong framework.
Mike Sherov
@user256007 - the general rule is "don't use SELECT *'. The answer from marc_s has all the reasosn why this is the case.
APC
+2  A: 

It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a perfomance perspective. However, for many other reasons, you should always select specifically those columns you want, by name.

It always pulls a tuple, because (in every vendors RDBMS I am familiar with, the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes. And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.

Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...

EXCEPTON. The only place where Select * is ok, is in the subquery after an Exists or Not Exists predicate clause, as in:

   Select colA, colB
   From table1 t1
   Where Exists (Select * From Table2
                 Where column = t1.colA)
Charles Bretana
"It always pulls a tuple" are you sure ? Hmm Okay So I was right. if thats the case `select *` will have less memory overhead than `select column` but same I/O overhead. so If we leave network overhead. `select *` if less overhead than that of `select column`
This is NOT true. One example off the top of my head is when you want only the value of an indexed column in MySQL (for example, just to check for row existence), and you're using MyISAM storage engine, it'll grab the data from the MYI file, which could be in memory, and not even go to disk!
Mike Sherov
Ya if the requested set of tuple is in memory there will be no I/O but thats special case.So What is the summery. If I select some indexed Column then entire tuple is not read ? otherwise entire tuple is read ?
I'm not exactly sure how MySql does caching, but in SQL Server, and In Oracle, even when data is in in-memory cache, it still accesses it using the same Page structre as it would when accessing it from disk. meaning that it would require one memory I/O per page of data... exactly the same as it would from disk. (except memory I/Os are much faster than Disk I/Os of course). Indeed, that's a goal of caching design, to make the access process totally independant on location of the data.
Charles Bretana
@Charles Bretana:So If I invoke `select ColumnName` there will be a memory overhead of stripping out not requested Cells out of the tuple and transmitting only the requested Columns. agreed ?
@user256007, it's not that much of a special case. I'm not sure of all the cases in which it doesn't read the entire tuple, I know enough exist to always specify. You have plenty of examples already :)
Mike Sherov
@user, Not necessarily, this is so insignificant as to not be a concern to DB engine design. In fact, it's more likely that all queries have to "process" each column they deliver from the Page, (Even when using `Select *`, so then, the fewer columns requested, the LESS the processing... But again, this is insginificant in-memory cpu processing load, so much more important are the potential maintenance and logic errors from using `Select *` DO NOT Use it for performance reasons..
Charles Bretana
+12  A: 

There are several reasons you should never (never ever) use SELECT * in production code:

  • since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time

  • if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network

  • if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)

  • in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.

Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.

marc_s
I am only bothered about memory and I/O overhead. I've already mentioned that `select *` will have more network overhead. according to your second point. you meant select operation don't pull atomic tuples. rather it pulls only the requested columns from the disks.so there will be a memory overhead in `select column` to check which cell's data to pull.as far I know Data is always stored on disk as tuples. not sure how select pulls it. so `select *` will not require a through check through Data Structure of the Table
@marc, whilst agreeing with you in practice, you are certainly correct in all cases when fetching column data from the table, as this question addresses), yr emphasis on EVER nevertheless drives me to point out that this rules is not general to ALL Sql queries... specifically, it's use in a subquery after an EXISTS predicate, (as in `Where Exists (Select * From ...`) the use of `Select *` is certainly no issue, and in some circles is considered a best practice.
Charles Bretana
@Charles Bretana: yes, the `IF EXISTS(SELECT *...` is a special case - since there, no data is really retrieved, but it's just a check for existance, the SELECT * is not an issue there...
marc_s
Typically if we need to consistently access specific parts of a table, we will create a view containing only the columns we need.Of course, we then do SELECT * from my_view. From a performance POV, is this just as bad as selecting all from the table?
baultista
A: 

During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.

It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.

Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.

Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.

The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.

DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.

However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.

If you have:

CREATE TABLE customer (
    id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR(150) NOT NULL,
    city VARCHAR(30),
    state VARCHAR(30),
    zip VARCHAR(10));

CREATE INDEX k1_customer ON customer(id, name);

Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.

Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.

Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.

This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.

In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.

Will Hartung
A: 

Hi, I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.

Marcelo
A: 

Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.

On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.

gxti