Our company is developing an internal project to parse text files. Those text files are composed of metadata which is extracted using regular expresions. Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.
The simplified database schema looks like this:
Items | Id | Name | |----|--------| | 1 | Sample |
Items_Attributes | ItemId | AttributeId | |--------|-------------| | 1 | 1 | | 1 | 2 |
Attributes | Id | AttributeTypeId | Value | |----|-----------------|-------| | 1 | 1 | 500mB | | 2 | 2 | 1.0.0 |
AttributeTypes | Id | Name | |----|---------| | 1 | Size | | 2 | Version |
There are many distinct text files types with distinct metadata inside. For every text file we have an Item
and for every extracted metadata value we have an Attribute.
Items_Attributes
allow us to avoid duplicate Attribute
values which avoids database size to increase x^10.
This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have.
Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute
and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query
SELECT FROM Items WHERE Size = @A AND Version = @B
would return a pivoted table like this
| ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 |
The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes
table increases an average of 10,000,000 rows per week.
Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.
So my question is, how would you scale this in order to decrease report execution times?
We came with this possible solutions:
- Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
- Use a key/value database like HBase (we don't really know if would solve our problem)
- Use a ODBMS rather than a RDBMS (we have been considering db4o)
- Move our software to the cloud (we have zero experience)
- Statically generate reports at runtime. (we don't really want to)
- Static indexed views for common reports (performance is almost the same)
- De-normalize schema (some of our reports involves up to 50 tables in a single query)