views:

172

answers:

7

Our company is developing an internal project to parse text files. Those text files are composed of metadata which is extracted using regular expresions. Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.

The simplified database schema looks like this:

Items

| Id | Name   |
|----|--------|
| 1  | Sample |
Items_Attributes

| ItemId | AttributeId |
|--------|-------------|
| 1      | 1           |
| 1      | 2           |
Attributes

| Id | AttributeTypeId | Value |
|----|-----------------|-------|
| 1  | 1               | 500mB |
| 2  | 2               | 1.0.0 |
AttributeTypes

| Id | Name    |
|----|---------|
| 1  | Size    |
| 2  | Version |

There are many distinct text files types with distinct metadata inside. For every text file we have an Item and for every extracted metadata value we have an Attribute.

Items_Attributes allow us to avoid duplicate Attribute values which avoids database size to increase x^10.

This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have.

Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by Attribute and then pivoting the resultset (http://msdn.microsoft.com/en-us/library/ms177410.aspx). So this example pseudo-sql query

SELECT FROM Items WHERE Size = @A AND Version = @B

would return a pivoted table like this

| ItemName | Size  | Version |
|----------|-------|---------|
| Sample   | 500mB | 1.0.0   |

The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes table increases an average of 10,000,000 rows per week. Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.

So my question is, how would you scale this in order to decrease report execution times?

We came with this possible solutions:

  • Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)
  • Use a key/value database like HBase (we don't really know if would solve our problem)
  • Use a ODBMS rather than a RDBMS (we have been considering db4o)
  • Move our software to the cloud (we have zero experience)
  • Statically generate reports at runtime. (we don't really want to)
  • Static indexed views for common reports (performance is almost the same)
  • De-normalize schema (some of our reports involves up to 50 tables in a single query)
+2  A: 

I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan.

With you current table layout, the query similar to this:

SELECT FROM Items WHERE Size = @A AND Version = @B

cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index.

You cannot even build an indexed view, since it would contain a self-join on attributes.

Probably the best decision would be to denormalize the table like this:

id  name  size  version

and create an index on (size, version)

Quassnoi
probably best to take the parameters, find out what the id is that you need from the lookup tables and then use the id in the query
SQLMenace
@SQLMenace: if 1,000,000 rows suffice size, 1,000,000 rows suffice version and 1,000 rows suffice both, the composite index would require 1,000 scans, and scanning the lookup tables twise would require 2,000,000 scans.
Quassnoi
thats right, the more filtering criteria you add the more it takes to execute
knoopx
@knoopx: impossibility to create composite indexes is the main drawback of such design. If you need multiple search criteria, you should keep your attributes as table columns
Quassnoi
if the values table is pretty narrow 2 * 4 bytes for the ids and another 20 bytes for the value and you have a index on value it might be doable since you could store a million rows in a little under 3500 pages but then again i don't know what the data looks like
SQLMenace
@SQLMenace: Schema and data looks exactly the same as this question ones. Everything is properly indexed. There are few many more non-participating columns and hundred surrounding tables with no relationship. I know that rebuilding the table will obviously take immeasurable benefits from the indexes but dynamic report generation is a functional requirement.
knoopx
We are evaluating the possibilities, if none of them works for us, we will be forced to denormalize or rebuild the tables on a column-based fashion.
knoopx
+2  A: 

Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: http://sqlcat.com/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx

Remus Rusanu
This white paper almost directly answers the poster's question with best practice advice.
Philip Rieck
A: 

A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.

That's assuming you are not trying to report across all ItemIds at once.

RedFilter
"That's assuming you are not trying to report across all ItemIds at once.", in fact we group by attribute value and sort by count of repeated values, so I guess this won't work?
knoopx
+1  A: 

Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so. You would incrementally update the data each day or at any interval you wish.

MicSim
+1  A: 

Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan

instead of something like this

SELECT FROM Items WHERE Size = @A AND Version = @B

you need to do this

SELECT FROM Items WHERE ID = 1

in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead

Probably also a good idea to look at partitioning function to distribute your data

clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story

SQLMenace
+2  A: 

Worked with such schemas a lot of time. They never perform well. The best thing is to just store the data as you need it, in the form:

| ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 |

Then you don;t need to pivot. And BTW, please do not call your original EAV schema "normalized" - it is not normalized.

AlexKuznetsov
A: 

You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan.

If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables. This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too.

Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.

I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

Ed Harper