All of us who work with relational databases have learned (or are learning) that SQL is different. Eliciting the desired results, and doing so efficiently, involves a tedious process partly characterized by learning unfamiliar paradigms, and finding out that some of our most familiar programming patterns don't work here. What are the common antipatterns you've seen (or yourself committed)?
use SP as the prefix of the store procedure name because it will first search in the System procedures location rather than the custom ones.
Here's my top 3.
Number 1. Failure to specify a field list. (Edit: to prevent confusion: this is a production code rule. It doesn't apply to one-off analysis scripts - unless I'm the author.)
SELECT *
Insert Into blah SELECT *
should be
SELECT fieldlist
Insert Into blah (fieldlist) SELECT fieldlist
Number 2. Using a cursor and while loop, when a while loop with a loop variable will do.
DECLARE @LoopVar int
SET @LoopVar = (SELECT MIN(TheKey) FROM TheTable)
WHILE @LoopVar is not null
BEGIN
-- Do Stuff with current value of @LoopVar
...
--Ok, done, now get the next value
SET @LoopVar = (SELECT MIN(TheKey) FROM TheTable
WHERE @LoopVar < TheKey)
END
Number 3. DateLogic through string types.
--Trim the time
Convert(Convert(theDate, varchar(10), 121), datetime)
Should be
--Trim the time
DateAdd(dd, DateDiff(dd, 0, theDate), 0)
Using meaningless table aliases:
from employee t1,
department t2,
job t3,
...
Makes reading a large SQL statement so much harder than it needs to be
1) I don't know it's an "official" anti-pattern, but I dislike and try to avoid string literals as magic values in a database column.
An example from MediaWiki's table 'image':
img_media_type ENUM("UNKNOWN", "BITMAP", "DRAWING", "AUDIO", "VIDEO",
"MULTIMEDIA", "OFFICE", "TEXT", "EXECUTABLE", "ARCHIVE") default NULL,
img_major_mime ENUM("unknown", "application", "audio", "image", "text",
"video", "message", "model", "multipart") NOT NULL default "unknown",
(I just notice different casing, another thing to avoid)
I design such cases as int lookups into tables ImageMediaType and ImageMajorMime with int primary keys.
2) date/string conversion that relies on specific NLS settings
CONVERT(NVARCHAR, GETDATE())
without format identifier
Human readable password fields, egad. Self explanatory.
Using LIKE against indexed columns, and I'm almost tempted to just say LIKE in general.
Recycling SQL-generated PK values.
Surprise nobody mentioned the god-table yet. Nothing says "organic" like 100 columns of bit flags, large strings and integers.
Then there's the "I miss .ini files" pattern: storing CSVs, pipe delimited strings or other parse required data in large text fields.
And for MS SQL server the use of cursors at all. There's a better way to do any given cursor task.
Edited because there's so many!
My bugbears are the 450 column Access tables that have been put together by the 8 year old son of the Managing Director's best friends dog groomer and the dodgy lookup table that only exists because somebody doesn't know how to normalise a datastructure properly.
Typically, this lookup table looks like this:
ID INT, Name NVARCHAR(132), IntValue1 INT, IntValue2 INT, CharValue1 NVARCHAR(255), CharValue2 NVARCHAR(255), Date1 DATETIME, Date2 DATETIME
I've lost count of the number of clients I've seen who have systems that rely on abominations like this.
I am consistently disappointed by most programmers tendency to mix their UI-logic in the data access layer:
SELECT
FirstName + ' ' + LastName as "Full Name",
case UserRole
when 2 then "Admin"
when 1 then "Moderator"
else "User"
end as "User's Role",
case SignedIn
when 0 then "Logged in"
else "Logged out"
end as "User signed in?",
Convert(varchar(100), LastSignOn, 101) as "Last Sign On",
DateDiff('d', LastSignOn, getDate()) as "Days since last sign on",
AddrLine1 + ' ' + AddrLine2 + ' ' + AddrLine3 + ' ' +
City + ', ' + State + ' ' + Zip as "Address",
'XXX-XX-' + Substring(
Convert(varchar(9), SSN), 6, 4) as "Social Security #"
FROM Users
Normally, programmers do this because they intend to bind their dataset directly to a grid, and its just convenient to have SQL Server format server-side than format on the client.
Queries like the one shown above are extremely brittle because they tightly couple the data layer to the UI layer. On top of that, this style of programming thoroughly prevents stored procedures from being reusable.
select some_column, ...
from some_table
group by some_column
and assuming that the result will be sorted by some_column. I've seen this a bit with Sybase where the assumption holds (for now).
Using SQL as a glorified ISAM (Indexed Sequential Access Method) package. In particular, nesting cursors instead of combining SQL statements into a single, albeit larger, statement. This also counts as 'abuse of the optimizer' since in fact there isn't much the optimizer can do. This can be combined with non-prepared statements for maximum inefficiency:
DECLARE c1 CURSOR FOR SELECT Col1, Col2, Col3 FROM Table1
FOREACH c1 INTO a.col1, a.col2, a.col3
DECLARE c2 CURSOR FOR
SELECT Item1, Item2, Item3
FROM Table2
WHERE Table2.Item1 = a.col2
FOREACH c2 INTO b.item1, b.item2, b.item3
...process data from records a and b...
END FOREACH
END FOREACH
The correct solution (almost always) is to combine the two SELECT statements into one:
DECLARE c1 CURSOR FOR
SELECT Col1, Col2, Col3, Item1, Item2, Item3
FROM Table1, Table2
WHERE Table2.Item1 = Table1.Col2
-- ORDER BY Table1.Col1, Table2.Item1
FOREACH c1 INTO a.col1, a.col2, a.col3, b.item1, b.item2, b.item3
...process data from records a and b...
END FOREACH
The only advantage to the double loop version is that you can easily spot the breaks between values in Table1 because the inner loop ends. This can be a factor in control-break reports.
Also, sorting in the application is usually a no-no.
The ones that I dislike the most are
Using spaces when creating tables, sprocs etc. I'm fine with CamelCase or under_scores and singular or plurals and UPPERCASE or lowercase but having to refer to a table or column [with spaces], especially if [ it is oddly spaced] (yes, I've run into this) really irritates me.
Denormalized data. A table doesn't have to be perfectly normalized, but when I run into a table of employees that has information about their current evaluation score or their primary anything, it tells me that I will probably need to make a separate table at some point and then try to keep them synced. I will normalize the data first and then if I see a place where denormalization helps, I'll consider it.
Overuse of either views or cursors. Views have a purpose, but when each table is wrapped in a view it's too much. I've had to use cursors a few times, but generally you can use other mechanisms for this.
Access. Can a program be an anti-pattern? We have SQL Server at my work, but a number of people use access due to it's availabilty, "ease of use" and "friendliness" to non-technical users. There is too much here to go into, but if you've been in a similar environment, you know.
The
FROM TableA, TableB WHERE
syntax for JOINS rather thanFROM TableA INNER JOIN TableB ON
Making assumptions that a query will be returned sorted a certain way without putting an ORDER BY clause in, just because that was the way it showed up during testing in the query tool.
var query = "select COUNT(*) from Users where UserName = '" + tbUser.Text + "' and Password = '" + tbPassword.Text +"'";
1) Not sanitizing user input (!!!!!)
2) Query via concatenation, aka not using parameterized queries
3) Cleartext passwords
I need to put my own current favorite here, just to make the list complete. My favorite antipattern is not testing your queries.
This applies when:
- Your query involves more than one table.
- You think you have an optimal design for a query, but don't bother to test your assumptions.
- You accept the first query that works, with no clue about whether it's even close to optimized.
And any tests run against atypical or insufficient data don't count. If it's a stored procedure, put the test statement into a comment and save it, with the results. Otherwise, put it into a comment in the code with the results.
The two I find the most, and can have a significant cost in terms of performance are:
Using cursors instead of a set based expression. I guess this one occurs frequently when the programmer is thinking procedurely.
Using correlated sub-queries, when a join to a derived table can do the job.
The Altered View - A view that is altered too often and without notice or reason. The change will either be noticed at the most inappropriate time or worse be wrong and never noticed. Maybe your application will break because someone thought of a better name for that column. As a rule views should extend the usefulness of base tables while maintaining a contract with consumers. Fix problems but don't add features or worse change behavior, for that create a new view. To mitigate do not share views with other projects and, use CTEs when platforms allow. If your shop has a DBA you probably can't change views but all your views will be outdated and or useless in that case.
The !Paramed - Can a query have more than one purpose? Probably but the next person who reads it won't know until deep meditation. Even if you don't need them right now chances are you will, even if it's "just" to debug. Adding parameters lowers maintenance time and keep things DRY. If you have a where clause you should have parameters.
The case for no CASE -
SELECT
CASE @problem
WHEN 'Need to replace column A with this medium to large collection of strings hanging out in my code.'
THEN 'Create a table for lookup and add to your from clause.'
WHEN 'Scrubbing values in the result set based on some business rules.'
THEN 'Fix the data in the database'
WHEN 'Formating dates or numbers.'
THEN 'Apply formating in the presentation layer.'
WHEN 'Createing a cross tab'
THEN 'Good, but in reporting you should probably be using cross tab, matrix or pivot templates'
ELSE 'You probably found another case for no CASE but now I have to edit my code instead of enriching the data...' END
Contrarian view: over-obsession with normalization.
Most SQL/RBDBs systems give one lots of features (transactions, replication) that are quite useful, even with unnormalized data. Disk space is cheap, and sometimes it can be simpler (easier code, faster development time) to manipulate / filter / search fetched data, than it is to write up 1NF schema, and deal with all the hassles therein (complex joins, nasty subselects, etc).
I have found the over-normalized systems are often premature optimization, especially during early development stages.
(more thoughts on it... http://writeonly.wordpress.com/2008/12/05/simple-object-db-using-json-and-python-sqlite/)
Maybe not an anti pattern but it annoys me is when DBA's of certain DB's (ok I'm talking about Oracle here) write SQL Server code using Oracle style and code conventions and complain when it runs so bad. Enough with the cursors Oracle people! SQL is meant to be set based.
Putting stuff in temporary tables, especially people who switch from SQL Server to Oracle have a habit of overusing temporary tables. Just use nested select statements.
Just to support Point 1 of David's answer check out this example of situation when using select * may produce some unexpected results
Joining redundant tables into a query like this:
select emp.empno, dept.deptno
from emp
join dept on dept.deptno = emp.deptno;
I just put this one together, based on some of the SQL responses here on SO.
It is a serious antipattern to think that triggers are to databases as event handlers are to OOP. There's this perception that just any old logic can be put into triggers, to be fired off when a transaction (event) happens on a table.
Not true. One of the big differences are that triggers are synchronous - with a vengeance, because they are synchronous on a set operation, not on a row operation. On the OOP side, exactly the opposite - events are an efficient way to implement asynchronous transactions.
using @@IDENTITY instead of SCOPE_IDENTITY()
Quoted from this answer :
- @@IDENTITY returns the last identity value generated for any table in the current session, across all scopes. You need to be careful here, since it's across scopes. You could get a value from a trigger, instead of your current statement.
- SCOPE_IDENTITY returns the last identity value generated for any table in the current session and the current scope. Generally what you want to use.
- IDENT_CURRENT returns the last identity value generated for a specific table in any session and any scope. This lets you specify which table you want the value from, in case the two above aren't quite what you need (very rare). You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into.
SELECT FirstName + ' ' + LastName as "Full Name", case UserRole when 2 then "Admin" when 1 then "Moderator" else "User" end as "User's Role", case SignedIn when 0 then "Logged in" else "Logged out" end as "User signed in?", Convert(varchar(100), LastSignOn, 101) as "Last Sign On", DateDiff('d', LastSignOn, getDate()) as "Days since last sign on", AddrLine1 + ' ' + AddrLine2 + ' ' + AddrLine3 + ' ' + City + ', ' + State + ' ' + Zip as "Address", 'XXX-XX-' + Substring(Convert(varchar(9), SSN), 6, 4) as "Social Security #" FROM Users
Or, cramming everything into one line.
Temporary Table abuse.
Specifically this sort of thing:
SELECT personid, firstname, lastname, age
INTO #tmpPeople
FROM People
WHERE lastname like 's%'
DELETE FROM #tmpPeople
WHERE firstname = 'John'
DELETE FROM #tmpPeople
WHERE firstname = 'Jon'
DELETE FROM #tmpPeople
WHERE age > 35
UPDATE People
SET firstname = 'Fred'
WHERE personid IN (SELECT personid from #tmpPeople)
Don't build a temporary table from a query, only to delete the rows you don't need.
And yes, I have seen pages of code in this form in production DBs.
For storing time values, only UTC timezone should be used. Local time should not be used.
Having 1 table
code_1
value_1
code_2
value_2
...
code_10
value_10
Instead of having 3 tables
code, value and code_value
You never know when you may have need more than 10 couples code, value.
You don't waste disk space if you only need one couple.
Learning SQL in the first six months of their career and never learning anything else for the next 10 years. In particular not learning or effectively using windowing/analytical SQL features. In particular the use of over() and partition by.
Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group.
See O'Reilly SQL Cookbook Appendix A for a nice overview of windowing functions.
Not using the With clause or a proper join and relying on sub-queries.
Anti-Pattern:
select
...
from data
where RECORD.STATE IN (
SELECT STATEID
FROM STATE
WHERE NAME IN
('Published to test',
'Approved for public',
'Published to public',
'Archived'
))
Better:
I like using the with clause to make my intent more readable.
with valid_states as (
SELECT STATEID
FROM STATE
WHERE NAME IN
('Published to test',
'Approved for public',
'Published to public',
'Archived'
)
select ... from data, valid_states
where data.state = valid_states.state
Best:
select
...
from data join states using (state)
where
states.state in ('Published to test',
'Approved for public',
'Published to public',
'Archived'
)
Re-using a 'dead' field for something it wasn't intended for (e.g. storing user data in a 'Fax' field) - very tempting as a quick fix though!
re: using @@IDENTITY instead of SCOPE_IDENTITY()
you should use neither; use output instead
Using primary keys as a surrogate for record addresses and using foreign keys as a surrogate for pointers embedded in records.