tags:

views:

982

answers:

11

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?

I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?

+14  A: 

Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT * you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.

Wouldn't it mean less code to change if you added a new column you wanted?

The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column - just a few characters of typing.

Mark Byers
Especially if that new column is a three-megabyte BLOB
Matti Virkkunen
@Matti - But hopefully they would put in more thought than *"Hey lets plop a huge BLOB column onto this table!"*. *(Yes a fools hope I know but can't a guy dream?)*
ChaosPandion
Performance is one aspect, but often there is also a correctness aspect: the shape of the result projected with `*` can unexpectedly *change* and this can wreak havoc in the application itself: columns referenced by ordinal (eg. sqldatareader.getstring(2)) suddenly retrieve a *different* column, any `INSERT ... SELECT *` will break and so on and so forth.
Remus Rusanu
@chaos: putting blobs on tables isn't really going to hurt your performance much... Unless you use SELECT *... ;-)
Dave Markle
@Dave - What I was trying to imply was that typically developers will just say *"Let's just add another column to the table."* instead of thinking about how it might disturb the integrity of their database.
ChaosPandion
@Chaos True, true. And it's even worse when developers say, "let's just add another index to the table..."
Dave Markle
+1  A: 

Generally you have to fit the results of your SELECT * ... into data structures of various types. Without specifying which order the results are arriving in, it can be tricky to line everything up properly (and more obscure fields are much easier to miss).

This way you can add fields to your tables (even in the middle of them) for various reasons without breaking sql access code all over the application.

jkerian
+1  A: 

Using SELECT * when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.

In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.

Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.

This is also the same sort of reasoning why if you're doing an INSERT you should always specify the columns.

Slokun
+2  A: 

In a lot of situations, SELECT * will cause errors at run time in your application, rather than at design time. It hides the knowledge of column changes, or bad references in your applications.

Andrew Lewis
So how does naming the columns help? In SQL Server, existing queries, embedded in code or SPs, won't complain until they run, even if you've named the columns. New ones will fail when you test them, but plenty of time you have to go looking for SPs affected by table changes. What sort of situations are you referring to that would be caught at design time?
ChrisA
+1  A: 

If you really want every column, I haven't seen a performance difference between select (*) and naming the columns. The driver to name the columns might be simply to be explicit about what columns you expect to see in your code.

Often though, you don't want every column and the select(*) can result in unnecessary work for the database server and unnecessary information having to be passed over the network. It's unlikely to cause a noticeable problem unless the system is heavily utilised or the network connectivity is slow.

Brabster
A: 

I don't think that there can really be a blanket rule for this. In many cases, I have avoided SELECT *, but I have also worked with data frameworks where SELECT * was very beneficial.

As with all things, there are benefits and costs. I think that part of the benefit vs. cost equation is just how much control you have over the datastructures. In cases where the SELECT * worked well, the data structures were tightly controlled (it was retail software), so there wasn't much risk that someone was going to sneek a huge BLOB field into a table.

JMarsch
+31  A: 

There are really three major reasons:

  • Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.

  • Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create a index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.

  • Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.

But it's not all bad for SELECT *. I use it liberally for these use cases:

  • Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying table names are. This gets to be a bigger "plus" the longer the column names are.

  • When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:

    SELECT COUNT(*) FROM table;
    

    in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.

    Same goes with this type of query:

    SELECT a.ID FROM TableA a
    WHERE EXISTS (
        SELECT *
        FROM TableB b
        WHERE b.ID = a.B_ID);
    

    in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * does. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)

Dave Markle
+1 for going deep...
Garis Suero
+1 Wow! I ran into the "Binding Problem" and couldn't figure out what caused it until now!
hopeseekr
Yeah that one almost caused me to throw things in the office the first time I encountered it!
Dave Markle
Using "SELECT id,name" is as likely as "SELECT *" to select two columns of the same name from two different tables when using joins. Prefixing with table name solves the problem in both cases.
Pies
+1  A: 

If you add fields to the table, they will automatically be included in all your queries where you use select *. This may seem convenient, but it will actually crash your application at a certain point.

There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.

This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect it to the cause.

If you specify which fields you want in the result, you are safe from this kind of overhead overflow.

Guffa
+20  A: 

The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.

Performance

The * shorthand can be slower because:

  • Not all the fields are indexed, forcing a full table scan - less efficient
  • What you save to send SELECT * over the wire risks a full table scan
  • Returning more data than is needed
  • Returning trailing columns using variable length data type can result in search overhead

Maintenance

When using SELECT *:

  • Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
  • If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
  • Even if you need every column at the time the query is written, that might not be the case in the future
  • the usage complicates profiling

Design

SELECT * is an anti-pattern:

  • The purpose of the query is less obvious; the columns used by the application is opaque
  • It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.

When Should "SELECT *" Be Used?

It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.

Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.

OMG Ponies
-1: the idea that more columns will necessarily cause a table scan to occur just isn't correct. While it is true that indexes may be ignored in some cases, that doesn't mean a table scan is more likely to occur.
Dave Markle
@Dave Markle: Because you can predict the optimizer decisions accurately, every single time... for every single possible query that could ever run?! Because statistics *never* change... **unless your data never changes**. This reeks of tactical downvote, even more cowardly when your answer is at 11 while mine was at zero.
OMG Ponies
That's not what I'm saying. I'm trying to make the point that table scans are going to be driven more by the rows you want than the columns you want. If selecting a large number of rows, then yes, you are more likely to get a table scan. But for small numbers of rows in many query patterns, your first statement is demonstrably false in many cases. Don't take the -1 too personally, but I'm strongly of the opinion that those of us with high rep have to be correct with our facts, especially when we give longer answers, because people believe what we say.
Dave Markle
@Dave Markle: I believe in accountability, but I don't believe in being downvoted for what's not even subjective - there's no database vendor identified, so your claim goes for all databases which isn't the case. I take it personally when no one else gets downvoted, and my content provides more than most others ON TOP OF being held accountable for a utterly flawed supposition - you believe corner cases are primary examples.
OMG Ponies
When someone asks me a question in the form of, "Why not?", corner cases are often the most important answers.
Dave Markle
@Dave Markle: There's no criteria, but rest assured--you'll impose it. Targeting high rep accounts just reinforces that you're too concerned with being overlooked for sake of reputation. But you apparently won't downvote if someone doesn't *mention* details you take issue with, only when they do - pretty flawed belief system. My synopsis is pretty accurate - you think you're being heroic to the community when you're really a spiteful coward.
OMG Ponies
If you'd like to pose a question about query optimization, post it, and I will happily do my best to answer it. Name-calling is extraordinarily unprofessional and unwarranted.
Dave Markle
@Dave Markle: Acting sanctimoniously judicial is *exactly* what you admit to. There's no need to infer from your answers, the criticism has merit. It's why most downvoters never admit to it - because they are afraid of being accountable for their decision. I took the downvote personally, because it's clear I was personally targeted.
OMG Ponies
Please read the FAQ in the section where it talks about "being nice". (and don't worry, you weren't the only answer I downvoted)
Dave Markle
OMG Ponies
@Dave Markle: So, *now* you want to suggest that you downvoted others for this question? *After* I pointed out how poor it makes your character look? Hmm... still lots of other answers, still at zero... Checked Mark Byers answer - no downvotes against it. Want to try again? You just keep digging a deeper hole...
OMG Ponies
+1 This is a very concise answer that is very well formatted and easy to read.
hopeseekr
Accepted answer b/c: 1. submitted around the same time as top-rated one, 2. contains virtually identical explanations (except about data binding problems, 3. other candidate downvoted this candidate, 4. answer is formatted much better, 5. much more concise answer.
hopeseekr
This smells of premature optimization.
Pies
"Not all the fields are indexed, forcing a full table scan - less efficient" -- This is not true if we have a where clause and have proper indexing for the columns and conditions inside the where clause.
Bytecode Ninja
OMG Ponies
+2  A: 

If you name the columns in a SELECT statement, they will be returned in the order specified, and may thus safely be referenced by numerical index. If you use "SELECT *", you may end up receiving the columns in arbitrary sequence, and thus can only safely use the columns by name. Unless you know in advance what you'll be wanting to do with any new column that gets added to the database, the most probable correct action is to ignore it. If you're going to be ignoring any new columns that get added to the database, there is no benefit whatsoever to retrieving them.

supercat
+1  A: 

Think of it as reducing the coupling between the app and the database.

To summarize the 'code smell' aspect:
SELECT * creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.

Kelly French