views:

195

answers:

7

I am facing a performance (it can lead to scaling issue later on) issue at the moment. The application I am working on is quite complex and it’s running on SQL Server 2005, I need to join 6 - 7 tables to get the desired data. Each table contains more than 100,000 rows of data so far. The database schema can not be changed (must stay as is). So I can only try to optimize as much as possible. 2 things come to my mind:

  • Try not to join in database and let the application server do the filtering by using LINQ

    • Pros: will be able to scale easily by adding more app servers
    • Cons: more effort, I am not sure if responsiveness will be decreased
  • Application server stays as is and try to optimize the SQL query as much as possible (more index, rebuild index frequently etc)

    • Pros: minimum effort
    • Cons: when table records get bigger the problem will come back again

Basically caching isn't an solution for me at the moment (hardware issue, hosting issues etc) and that's why I didn't bring it up originally. But I do know what the benefits of caching will bring to me and have used many times.

+1  A: 

You need to examine what indexes are in place already, whether they (and statistics) are up to date, and whether new indexes would benefit your query workload.

Mitch Wheat
+2  A: 

In general, I consider a number of points when talking about scale:

  1. How often does it get executed? For less frequently accessed queries you may be able to accept some degradation in performance.

  2. What's the rate of growth / change? If the records are relatively static in some of those tables, you may want to consider caching the contents externally in a dbm-type of file (or whatever the windows equivalent is). There's also things like memcache that might be worth looking at. This may or may not be possible, though. This is predicated on performing "joins" in application code.

  3. Profile. If you're joining on indexed columns (and you are, aren't you?), you're not necessarily going to degrade as the number of rows grow. This is going to depend heavily on whether you're dealing with 1:1 or 1:N relationships, what the average size of N is, how much available memory you have available on the database server, how often your table statistics are computed, and the type of columns and indexes. If you're dealing with a 1:1 relationship and it's unique, the database is going to be able to do a simple hash and look up.

Make sure you limit the columns fetched to absolutely no more than what you need, especially when joining many tables, because if all that is required to join two tables are the columns that are indexed, the database may not even consider the table at all; the join can be performed using just the indexes. This reduces contention and improves the performance of less optimal queries that need to deal with the actual contents of the table because there's fewer queries pulling on the table.

All relational databases have a tool or feature to view the query execution plan for the given query. Use it. If the output doesn't make sense to you, learn it. This is your primary window into understanding what the database will do with a given query, what indexes will be used, what the estimated (or actual) number of rows that will be encountered in each step of execution, and other fun stuff.

Once you have information about what the query optimizer is actually doing with the query, and you have all your indexes / statistics / column selection straight, you'll have a better idea of where to go from there. If you do all you can in the database, you'll have to look at using caches of the data and doing things like going at fewer tables with a more specific / better where clause.

Disclaimer: I have no direct experience with SQL Server, but I have a lot of experience on other RDBMSs (Oracle, MySQL, PostgreSQL, etc.) and architecture in general.

Eric Sammer
+1  A: 

By adding more servers in "Not to join" scenario you'll get more performance boost either trying to optimize joins. You're right - problem will came back when you'll have more data.

The best solution is to use memory caching. You can cache Table-Table relationships which are mostly small in size and do not fetch them all the time.

Optimal is to minimize joins, minimize selects and then cache rarely changed data in to the memory. That will give a boost.

As from Microsoft's (as well as other DB manufacturers) recommendation about Joins - Use them as optimal as possible. From my experience - more than 2-3 joins in top number for complex selects.

Mr.ElectroNick
+3  A: 

Generally speaking, do the joining in the DBMS. If you do it in the application server, you are betting that you can do a better job of optimizing joins than the people who wrote the DBMS, and (further) that you can out-perform their best efforts by enough to offset the cost of transferring the unjoined data across the wire.

Now, if you are going to do a cross-product of two wide tables (lets say that they are T1, with N1 rows of width W1 and T2 with N2 rows of width W2) with no filtering, then the DBMS is obliged to create and send N1 * N2 * (W1 + W2) bytes of data over the wire, whereas you could suck down the tables separately as N1 * W1 + N2 * W2 bytes of data. If N1 = N2 = 1M and W1 = W2 = 100, then that's 200 TB vs 200 MB of data transfer in favour of doing the cross-product in the app server. But that's not exactly fair to the DBMS. Most queries are not that silly - they join on columns and apply conditions, and the DBMS optimizer will struggle mightily (and automatically) to minimize the work done. Further, it will only send the pertinent data back to you; it does not have to send all the rows that don't match your criteria.

To show an alternative scenario (in favour of the DBMS) consider a case where T1 has N1 = 1M rows of width W1 = 100, but T2 has N2 = 100K rows of width W2 = 50. There is a join between the two tables on an integer column, and there are, therefore, 10 rows in T1 for each on in T2. Suppose that you suck down all of T1 and T2 to the app server: that requires N1 * W1 + N2 * W2 = 105 MB of data. But the filter conditions limit the data to 1/10 of the rows in T2 and for each row in T1 that matches a row in T2, there is in fact only 2 rows that match the filter conditions. Now the DBMS is only going to transfer N2 * (W1 + W2) / 5 = 3 MB, a saving of over 100 MB of data transfer by the DBMS. Now, if you manage to be clever and download just the N2 * W2 / 10 = 500 KB of data that corresponds to the values in T2, you still have to get the DBMS to do the 'semi-join' of T1 on the values you want to get the right rows from T1 to the app server. If you only need a subset of the columns, there can be another set of savings. And DBMS tend to have rather clever sort packages; you'll need a good sort package in your app server to present data in the correct order.

It should normally be a hands-down win for joins in the DBMS. If it isn't, it is because you are asking the server to do more work than it can handle. In that case, you need to look at whether replicating the database server makes sense, or whether adding more cores, or more network bandwidth, or more main memory will do the job.

Jonathan Leffler
+1  A: 

You mention that each table has "more than 100,000 rows" but you don't mention how much data you're selecting, and how complex the join is. 100K rows is not large for a correctly setup and indexed SQLServer. We've got 17-way joins that are returning results in several ms, but it's well indexed and selecting few rows. I would look at the profiling information on SQLServer before starting to redesign your application.

WaldenL
A: 

Just add more RAM. A database that fits entirely in RAM forgives a lot of mistakes.

Seun Osewa
A: 

Don't neglect the overhead of transferring the data among servers. Ethernet degrades pretty rapidly under load (I think the sustained transfer rate is something like 30% of the single-packet rate; i.e., your 100Mb/sec link will really only do 30Mb of heavy traffic). Once you saturate your link at the DB server, adding more app servers won't matter, because you won't be able to get the data out any faster.

Joins on an app server also put you at the mercy of the slowest one. We saw performance tank at a client site, and discovered that the primary app server had crashed, and the client's recovery strategy was to have the machine fail over to a virtual machine running on one of the other servers. Kind of a neat solution, but certainly not as performant. I've also seen slowdowns when routers fail and suddenly all your peer servers are three or four hops away instead of being on the same subnet.

TMN