views:

390

answers:

3

What optimization techniques do you use on extremely large databases? If our estimations are correct, our application will have billions of records stored in the db (MS SQL Server 2005), mostly logs that will be used for statistics. The data contains numbers (mostly integer) and text (error message texts, URLs) alike.

I am interested in ANY kind of tips, hacks, solutions.

+7  A: 

The question is a little big vague, but here are a few tips:

  • Use appropriate hardware for your databases. I'd opt for 64-bit OS as well.
  • Have dedicated machines for the DBs. Use fast disks configured for optimal performance. The more disks you can span over, the better the performance.
  • Optimize the DB for the type of queries that will be performed. What happens more SELECTs or INSERTs?
  • Does the load happens for the entire day, or for just few hours? Can you postpone some of the things to be run for the night?
  • Have incremental backups.
  • If you'll consider Oracle instead of SQL Server, you could use features such as Grid and Table Partitioning, which might boost performance considerably.
  • Consider having some load-balancing solution between the DB servers.
  • Pre-design the schemes and tables, so queries will be performed as fast as possible. Consider the appropriate indexes as well.

You're gonna have to be more specific about the way you're going to store those logs. Are they LOBs in the DB? Simple text records?

Moshe
SQL Server supports partitioning and clustered shared-disk deployment (which is the same topology used by Oracle OPS/RAC/Grid). The partitioning support is more mature on Oracle but SQL Server has supported partitioning since 2000.
ConcernedOfTunbridgeWells
A: 

I don't use it myself but I have read that one can use Hadoop in combination with hbase for distributed storage and distributed analysing of data like logs.

tuinstoel
A: 

duncan's link has a good set of tips. Here are a few more tips:

If you do not need to query against totally up-to-date data (i.e. if data up to the last hour or close of business yesterday is acceptable), consider building a separate data mart for the analytics. This allows you to optimise this for fast analytic queries.

The SQL Server query optimiser has a star transformation operator. If the query optimiser recongises this type of query it can select what slice of data you want by filtering based on the dimension tables before it touches the fact table. This reduces the amount of I/O needed for the query.

For VLDB applications involving large table scans, consider direct attach storage with as many controllers as possible rather than a SAN. You can get more bandwidth cheaper. However, if your data set is less than (say) 1TB or so it probably won't make a great deal of difference.

A 64-bit server with lots of RAM is good for caching if you have locality of reference in your query accesses. However, a table scan has no locality of reference so once it gets significantly bigger than the RAM on your server extra memory doesn't help so much.

If you partition your fact tables, consider putting each partition on a sepaarate disk array - or at least a separate SAS or SCSI channel if you have SAS arrays with port replication. Note that this will only make a difference if you routinely do queries across multiple partitions.

ConcernedOfTunbridgeWells