views:

171

answers:

3

Hello there!

I have a database in SQL Server 2008 that supports various applications. I am now starting the development of a new application and, from that database, I require data that fits in one table (though there is some replication).

The data I way may be extracted to a view (a few inner joins) with 6 columns. I'm developing a web application that must be really fast navigating through the data (about 6 million records). Therefore, I would like to know if it is advisable to query the SQL Server view or to deploy a new, updatable, database indexed by Lucene or similar mechanism (BerkeleyDB?)

Thank in advance!

A: 

Are you familiar with the Boyce-Codd Normal Forms?

If you want to keep your current infrastructure, want to constrain your data to a single table, and are concerned with performance, I highly recommend you add additional columns to the table and store de-normalized data in those columns.

For example:

Say you had a table, people. Each person has a manager_id that links to another person in the same table as their manager. You could add a column for the manager called employees_count that stored a counter that tracked how many people were under that manager.

This is a simple but effective example of de-normalized data

Tricon
+1  A: 

views don't make accessing your data any faster, they just make writing queries easier. this means that selecting data from your view will be the equivalent of doing the few joins that comprise the view, and a join on a table involving 6 million records will take awhile.

the question is, what do you want to speed up? if you want fast selects from your complex data set, you could try to build real tables (not views) containing the data you want and update those tables from the main table at preset intervals.

if your data is going to be constantly modified and you need fast updates and for selects to always reflect the latest data, you might want to optimize your table structure. see if you can do away with the joins. this might come at the cost of duplicating your data, which means you'll need heavy use of transactions.

Igor
Well... the data is updated monthly... and I only need to make selects...So you think I should create a new set of tables (2 or 3) with the required data and update them when the main database is updated?
Pedro Lopes
In general correct however see leonm's comment on indexed voews
Mark
@Pedrolopes, splurge and do 5 or 6! how on earth from your question can anyone recommend how many tables you should have? you give zero details
KM
A: 

It is probably a difficult question to answer without knowledge of the exact data structure. Indexed views in SQL Server is good for this and could very well meet your performance criteria.

At this point it is advisable to run a test (Perhaps a JMeter jdbc load test) against a view. You could determine the viability of the simplest solution with relatively little effort before going down the route for creating additional indexes with Lucene or BerkeleyDB.

leonm
It is a overnormalized data structure... It has 6 tables so for even the smallest operations I need to make 2 or 3 joins... And it has about 30 million entry and I just want a subset (approx. 6 million) of those..So, create a new set of tables and test its performance? SQL Server Analysis Services have given me some great results...
Pedro Lopes
I would go with new tables/views initially and only switch to another approach if it is really necessary. Do lots of testing to see if the performance meets your needs but it sounds if you're ok.Keeping an additional Lucene index has its own set of problems. The simpler you can keep things, the more likely it is to scale properly.
leonm
Pedrolopes that is hardly an over normalized structure. I often have to query much larger tables with far more joins than that and still get good performance. There is nothing inherently wrong with joining to 2 or three tables if you have indexed correctly and 6 million records is a tiny dataset for SQL Server. You should easily get good performance using standard queries. Views can hurt performance especially if you pile views on top of views. Indexed views can help but you can only index a view that doesn't reference another view. Think long and hard before you go down that path!
HLGEM