views:

817

answers:

14

Is it true that ORDER BY is generally pretty slow? I am trying to run some sql statements where the WHERE clause is pretty simple, but then I am trying an 'ORDER BY' on a VARCHAR(50) indexed column.

I need to sort alphabetically for display reasons. I figured that getting the database to do it for me is the most efficient.

At this point, I am looking to either

  • optimize the sql query
  • sort the result set in code

Here is the actual query I am trying to run:

// B.SYNTAX is a TEXT/CLOB field
// Indexes on NAME, MODULENAME. PREVIOUS is a CHAR(1) with no index
"SELECT A.NAME, B.SYNTAX, B.DESCRIPTION, A.RATE1, A.RATE2, 
A.RATE3, A.STARTDATE, A.ENDDATE, A.HIDE, A.CATEGORYNAME 
FROM A, B WHERE A.MODULENAME='"+loadedModuleName+"' 
AND A.NAME = B.NAME AND (A.PREVIOUS<>'N' OR A.PREVIOUS IS NULL) 
ORDER BY A.NAME"

The size of table A is ~2000 rows and B is about ~500.

I should probably also mention that I cannot do much database specific optimization since we support multiple databases. Also, the application is deployed at a customer site.

I am expecting hundreds of records to be returned (less than 1000).

What would you do? Any tips are appreciated. Thanks.

+1  A: 

It shouldn't be slow. Optimize your query and database structure (at least indexes and statistcs if it's SQL Server). Maybe there is some other thing in your query other than ORDER BY which causes this slowness?

SELECT A.NAME, B.SYNTAX, B.DESCRIPTION, A.RATE1, A.RATE2, A.RATE3,
       A.STARTDATE, A.ENDDATE, A.HIDE, A.CATEGORYNAME
FROM Table1 A JOIN Table2 B on A.Name = B.Name
WHERE A.MODULENAME = @ModuleName AND A.PREVIOUS<>'N' OR A.PREVIOUS IS NULL
ORDER BY A.NAME

Option 1

If you're quering just a few simple columns (2-4), you can include them into the index as well. This way your query will be ran faster. Also make sure that sorting order on that index column matches sorting order in your query.

// if your query looks like this:
SELECT [Name], [Title], [Count] ORDER BY [COUNT]

// you can create an index on [Name], [Title], [Count]

Option 3

Create a view and bind it to the schema. Then query data from that view.

Option 3

If you use SQL Server 2005 and obove, you can also try to run you query in SQL Server Profiler and it will recommend to you the best index and statistics which you can apply to your table in order to optimize this particular query's performance.

Option 4

Try to rebuild your indexes and statistics.

Option 5

You can try putting you index/table into separate filegroup on different hard drive.

Koistya Navin
It's better to create an index on [Name], [Title], [Count], so it can be used both for filtering and for ordering.
Quassnoi
Quassnoi, sounds reasonable.
Koistya Navin
+3  A: 

ORDER BY is not generally slow, provided that the database can find an index that corresponds with the ORDER BY expression.

However, your SQL statement might include other things that force the database to scan the entire table before returning the results, like SELECT TOP n

Kluge
By "scan the entire database" did you mean "scan the entire *table*" ?
JosephStyons
Lol - I'm sure he did...
Mark Brittingham
Please fix in post!
Mark Canlas
Fixed. Whew... The dangers of a quick response...
Kluge
A: 

It's not a fair statement to say that "order by" is slow in and of itself. You have many RDBM's to consider as far as their own implementation, and the data type and indexing scheme. I would, however, doubt that you can sort it faster client-side than you can on the server, but that isn't to say that sorting it on the server is the right thing to do.

Adam Robinson
+1  A: 

If you are selecting few enough rows to display, it's not conceivable that the ORDER BY clause would take any perceptible amount of time unless you are limiting the number of rows returned with LIMIT or TOP.

We need more info. What dbms? What does the query plan look like? Have you looked at query plans with and without ORDER BY? What differences do you see?


EDIT:

SELECT A.NAME, B.SYNTAX, B.DESCRIPTION, A.RATE1, A.RATE2,
A.RATE3, A.STARTDATE, A.ENDDATE, A.HIDE, A.CATEGORYNAME
FROM A, B
WHERE A.MODULENAME='"+loadedModuleName+"'
AND A.NAME = B.NAME
AND (A.PREVIOUS<>'N' OR A.PREVIOUS IS NULL)
ORDER BY NAME

Is NAME the primary key? Is there an index on NAME? By itself, or with other fields? In what sequence?
How many rows are returned for one loadedModuleName?
I suspect slowness comes from "A.PREVIOUS <> 'N' OR A.PREVIOUS IS NULL" Try using (NOT A.PREVIOUS = 'N') which I think is equivalent and may help a bit.
Time the query with and without the ORDER BY and see if the timing is at all different. It shouldn't be.


EDIT:

If NAME is not unique in either A or B, your join is going to go partially ballistic when every A.NAME instance becomes cross-joined on B.NAME. If 50 A rows match and 50 B rows match, you'll end up with 2500 result rows, which may not be what you intend.

le dorfier
+5  A: 

Order by on an indexed field should not be slow as it can pull the data in index order. You might want to put up information about the database structure (the DDL) and the actual query so people can take a look.

You absolutely should use a SQL sort rather than a code sort where possible so you are on the right track there.

UPDATE: Ok, a few things. First, you should not use the " +loadedModuleName+" construct as it makes each query unique and screws up the optimizer. Use a parameter. Second, your Order by clause is ambiguous as to whether it is table A or B - make this explicit and choose the table with the Index (even if both have indices, make it explicit). Finally, your "Previous" field can still be indexed even as a char(1). I would do everything but the last suggested index, test speed and, if still slow, go for the index and check again.

UPDATE So you'll be returning <1000 records but what is the size of the table in total?

UPDATE Oh, man, I'm sorry I didn't catch this before. If you want to deploy it correctly on SQL Server, your query should be:

SELECT A.NAME, B.SYNTAX, B.DESCRIPTION, A.RATE1, A.RATE2, A.RATE3, A.STARTDATE, A.ENDDATE, A.HIDE, A.CATEGORYNAME 
FROM Table1 A join Table2 B on (A.Name=B.Name)
WHERE (A.MODULENAME=@ModuleName) AND (A.PREVIOUS<>'N' OR A.PREVIOUS IS NULL) 
ORDER BY A.NAME

Try this and I will almost guarantee you'll see a huge speed-up.

Mark Brittingham
I assume from the uppercase everything that this is oracle. In any case, "WHERE A.NAME = B.NAME" is equivalent, isn't it? Both ways return a cross-join if both A and B are not unique on name.
le dorfier
(Pls. see notes in my response, which dropped out of sight when I edited it.)
le dorfier
Good point RE: Oracle. I use upper case for keywords (except when I forget ;-) so I hadn't thought of that. No, I suggested the alternate SQL because I am pretty sure that, in SQL Server at least, including all of your query constraints in your join clause poses grave problems for the optimizer
Mark Brittingham
-continued- If you pull them outside the join then SQL Server will whittle down the set of rows before doing the join, thus resulting in a significant speed-up. *I could be wrong* so don't take this as gospel but I seem to remember this from an article in SQL Server magazine.
Mark Brittingham
I imagine that something similar may well be true of Oracle. I'd strongly suggest that Kapsh give it a try (it is equivalent and better formed anyhow). If I *am* wrong, it will be a learning experience for all of us!
Mark Brittingham
Yup - MySQL OTOH prefers everything packed into the JOIN expression. But I still think the most likely problem is a cartesian product on multiple rows in A joining with multiple rows in B on NAME. I'm suspicious about "I am expecting ..." vs. "I am getting ..."
le dorfier
*Very* good point le dorfier
Mark Brittingham
+1  A: 

If your filter looks like this:

WHERE col1 = @value1
      AND col2 = @value2
      AND col3 = @value3
ORDER BY
      col4

, then you'll need to create an index on (col1, col2, col3, col4).

The optimizer will use the index both to filter on the first three values and to order by the fourth one.

If you don't have such an index, then either of the following will happen:

  1. Optimizer will use an index on to filter on the WHERE condition, but it will still have to ORDER remaining rows.
  2. Optimizer will use an index to ORDER the values, but ALL values will need to be looked upon to filter them out.
  3. Optimizer will not use an index at all, so 1 and 2 will both be in effect.
Quassnoi
+1  A: 

ORDER BY is not particularly slow, especially if there is an index on that column. In particular, if you have a clustered index on that column, the data is already sorted.

You can also use paging (TOP or ROW_NUMBER) etc to help.

Marc Gravell
A: 

There are a lot of issues at play here.

In terms of pure performance and assuming things like indexes are set up correctly, databases are very good at sorting. For one query in isolation, sorting at the the database is probably fastest.

Unfortunately, in practice the database often becomes the bottleneck for an app. Anything you can do to move work away from the database will improve your overall application throughput. That includes moving sort operations to a less-busy business, web, or presentation tier. The presentation tier may not be able to sort one query as efficiently, but it might be better positioned to handle that overall load. This is especially true when you can reliably push that work all the way to individual end-user machines, though that can be problematic.

On the other hand, there's more to consider here than pure performance. You also want to think about future maintenance. What's more maintainable than a simple "ORDER BY" clause? That's only one line of code, compared to who knows how much extra programmer work to get a sort working elsewhere. This might be a case where you're better off throwing some money at the problem to make sure your database is simply able to maintain an acceptable level of performance, sorts and all.

Even here the issue isn't cut and dried. There's a school of thought that believes sorting really should be considered a function of the presentation tier anyway, and that ultimately the presentation tier is also the more maintainable place to do this work. I don't subscribe to that theory, but it is out there.

Joel Coehoorn
A: 

ORDER BY forces the RDBMS to sort.

Sorting requires resources that may not be present on your RDBMS server.

In some cases (i.e., single-table queries) you can write an ORDER BY that matches the indexes -- and if your RDBMS guarantees that tables are kept in an index order -- it might be zero cost. [A DB design which depends on a lot of single-table queries can be improved even more by discarding the RDBMS and using files.]

Generally, ORDER BY is going to have to sort.

"I figured that getting the database to do it for me is the most efficient."

That assumption is faulty. The database is not necessarily more efficient than your program outside the database.

S.Lott
A: 

Sorting on the client, is in my opinion something you should not do. Database engines are optimized for sorting data.

Like others said, if you can limit the number of rows you select, this will run faster.

Answering the following questions could help go further:

  • How many rows are returned by the query ?
  • How many columns are being selected ?
  • Do you join with any tables ?
  • How long does it take with / without the ORDER BY ?
Martin
+1  A: 

Keep in mind that many query editors will show results after just the first 50 or so have come back from the database.

Adding an ORDER BY will force it to wait on the database for all results, which will reveal the real speed of the query.

In those cases, the original query and the ORDERed one are the same speed; you were just fooled into thinking the first one was fast, because your editor was quick to get the top 50 or so rows.

JosephStyons
Ahh, the good old fat cursor. I think you hit the nail on the head with this answer. It would at least be a good test for him to make sure he times the query with and without the order by and makes sure he waits until the last row is returned before stopping the timer.
JohnFx
+1  A: 

UPDATE: As the query you posted, I think the best option is to consider the query as good because:

  • For few rows, don't care about who do the work. Then the easier for you is use the ORDER BY.
  • For a lot of rows, don't leave the client do the work: The RDMBS it's more specialized and sure the server have more memory and CPU.


The tips for orders you have to consider are:

  • ORDER BY is the ONLY way to warranty sort on a SQL query.
  • The best worker on sorting is the database in any case: BE SURE ON THIS!
  • Try to minimize the cardinality for returned rows.
  • Create indexes according the query. It means put the ordered columns last on the index.
  • Avoid indexing if the query is fast.
  • You can consider that the indexes are sorted, then if you sort for only a table and have good indexes the sort can have cost near zero.

For more rules of thumb about indexes look for this other SO question.

FerranB
A: 

There are a ton of really good suggestions here, but there is one minor thing I don't see that I'd like to comment on.

What database are you using? As someone who spends a ton of time on MySQL the thing that jumps out at me is the OR statement. MySQL can be really stupid with ORs. I've seen it be faster to do two selects and UNION them together.

If your row count is large (in the table, not returned) that could be a factor.

Otherwise I'd agree with the other posts. Indexes should make it fast, and it's often better to let the DB do it rather than handle it yourself. The DB knows what it's doing. Unless you have a REALLY big data set and want to shift the sorting burden to the client (so the DB can take more queries), I'd let the DB do the sorting work.

MBCook
A: 

I did some performance testing last night on a more production-type database (not the developer one) and here is what I found:

Total rows in table A: 13000

Total rows in table B: 5000

Rows returned by join query : 5000

Time taken if using with ORDER BY clause: ~5.422 seconds

Time taken if not using ORDER BY clause: ~5.345 seconds.

So it looked like the ORDER BY wasnt making much of a difference. (I am okay with the few milliseconds added).

I also tested by setting all B.SYNTAX values to NULL to make sure that it wasnt just the network latency with transferring so much data.

Now I removed the B.SYNTAX from the SELECT clause and the query took only 0.8 seconds!

So it seems that the whole CLOB column is the bottleneck. Which doesnt mean that I have gotten the solution to making this query faster, but at least I wont spend time writing a sorting algorithm.

Thanks to all who replied. I learned quite a bit and it led me to try a few different things out.

Kapsh