views:

161

answers:

4

I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.

We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.

With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ? The data volumes are expected to reach 10 million records.

I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.

+1  A: 

You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?

nightcoder
+1  A: 

Different databases need to be tuned in different ways. What RDBMS are you using?

Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?

When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.

Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.

RibaldEddie
We are using Oracle 10gR2. I am working on a roadmap where as the data volume increases, a well defined set of steps would ensure good performance.
Sathya
Sounds like you are in good shape and don't need our help then! :)
RibaldEddie
+4  A: 

Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.

Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.

Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.

Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.

Bill Karwin
The issue with mock data I see is the diversity of values especially on the indexes columns. Unless it reflects the way real life data would be distributed, I feel the results cant be taken seriously. But yes, we would anyway be creating the mock data.
Sathya
You already can calculate the distribution of real data based on the current real data you have. You can then generate test data with the same distribution characteristics. You mention the diversity of values on indexes columns. I don't see why you can't calculate the current distribution and then generate test data with the same pattern.
RibaldEddie
Right, you don't have to generate absolutely random test data. Though it it's true that your current small data set can't predict perfectly what the data diversity will be when you have millions of rows, I don't agree that this means "the results can't be taken seriously." Realistically, a test with synthetic data is accurate enough to help you tune your SQL queries.
Bill Karwin
A: 

10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.

Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.

Archiving old data may help but this is probably overkill for 10 million rows.

One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.

You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.

ConcernedOfTunbridgeWells
Thanks for the pointers. The Db size is expected to grow 200GB every year. I might've been wrong in saying 10 million rows.
Sathya