views:

1176

answers:

8

Some time ago I thought an new statistics system over, for our multi-million user website, to log and report user-actions for our customers.

The database-design is quite simple, containing one table, with a foreignId (200,000 different id's), a datetime field, an actionId (30 different id's), and two more fields containing some meta-information (just smallints). There are no constraints to other tables. Furthermore we have two indexes each containing 4 fields, which cannot be dropped, as users are getting timeouts when we are having smaller indexes. The foreignId is the most important field, as each and every query contains this field.

We chose to use SQL server, but after implementation doesn't a relational database seem like a perfect fit, as we cannot insert 30 million records a day (it's insert only, we don't do any updates) when also doing alot of random reads on the database; because the indexes cannot be updated fast enough. Ergo: we have a massive problem :-) We have temporarily solved the problem, yet

a relational database doesn't seem to be suited for this problem!

Would a database like BigTable be a better choice, and why? Or are there other, better choices when dealing with this kind of problems?

NB. At this point we use a single 8-core Xeon system with 4 GB memory and Win 2003 32-bit. RAID10 SCSI as far as I know. The index size is about 1.5x the table size.

+2  A: 

You aren't providing enough information; I'm not certain why you say that a relational database seems like a bad fit, other than the fact that you're experiencing performance problems now. What sort of machine is the RDBMS running on? Given that you have foreign ID's, it seems that a relational database is exactly what's called for here. SQL Server should be able to handle 30 million inserts per day, assuming that it's running on sufficient hardware.

Adam Robinson
We really don't care about any relational integrity. The inserts are fast enough, yet the indexes cannot be updated fast enough.
Jan Jongboom
+7  A: 

It sounds like you may be suffering from two particular problems. The first issue that you are hitting is that your indexes require rebuilding everytime you perform an insert - are you really trying to run live reports of a transactional server (this is usually considered a no-no)? Secondly, you may also be hitting issues with the server having to resize the database - check to ensure that you have allocated enough space and aren't relying on the database to do this for you.

Have you considered looking into something like indexed views in SQL Server? They are a good way to remove the indexing from the main table, and move it into a materialised view.

Pete OHanlon
+1 I was just typing up something similar.
timdev
Going to test the indexed view. Hadn't thought about that myself.
Jan Jongboom
indexed view = more indexes...
gbn
I found the following on http://www.sql-server-performance.com/tips/indexed_views_p1.aspx: # Avoid creating indexed views on tables that change a lot (many INSERTS, UPDATES or DELETES). Why is this?
Jan Jongboom
Indexed views are even more complex than table indexes!
wqw
A: 

Sybase IQ seems pretty good for the goal as our architects/DBAs indicated (as in, they explicitly move all our stats onto IQ stating that capability as the reason). I can not substantiate myself though - merely nod at the people in our company who generally know what they are talking about from past experience.

However, I'm wondering whether you MUST store all 30mm records? Would it not be better to store some pre-aggregated data?

DVK
Well at this point we use a staging table, and aggregate the data at night and bulk insert this into the main table (removing the indexes, and adding them afterwards). But we'd like to get a more real-time view of the actions on the site.
Jan Jongboom
+3  A: 

You could try making the table a partitioned one. This way the index updates will affect smaller sets of rows. Probably daily partitioning will be sufficient. If not, try partitioning by the hour!

wqw
+2  A: 

Replicating the database for reporting seems like the best route, given heavy traffic. However, a couple of things to try first...

Go with a single index, not two indexes. A clustered index is probably going to be a better choice than non-clustered. Fewer, wider indexes will generally perform better than more, narrower, indexes. And, as you say, it's the indexing that's killing your app.

You don't say what you're using for IDs, but if you're using GUIDs, you might want to change your keys over to bigints. Because GUIDs are random, they put a heavy burden on indexes, both in building indexes and in using them. Using a bigint identity column will keep the index running pretty much chronological, and if you're really interested in real-time access for queries on your recent data, your access pattern is much better suited for monotonically increasing keys.

Cylon Cat
yes that definitely sounds like the best approach; one copy of the database with no indices at all for all the inserts, and a replicated copy with the indices for the reporting. That way, the indices don't get in the way of the updates happening
marc_s
A: 

Not sure about SQL server but in another database system I have used long ago, the ideal method for this type activity was to store the updates and then as a batch turn off the indexes, add the new records and then reindex. We did this once per night. I'm not sure if your reporting needs would be a fit for this type solution or even if it can be done in MS SQL, but I'd think it could.

Deverill
We are doing this as a workaround as we faced serious problems. Yet I'd prefer a more realtime approach.
Jan Jongboom
+8  A: 

You say that your system is capable of inserting 3000 records per second without indexes, but only about 100 with two additional non-clustered indexes. If 3k/s is the maximum throughput your I/O permits, adding two indexes should in theory reduces the throughput at about 1000-1500/sec. Instead you see a degradation 10 times worse. The proper solution and answer is 'It Dependts' and some serious troubleshooting and bottleneck identification would have to be carried out. With that in mind, if I was to venture a guess, I'd give two possible culprits:

A. Th additional non-clustered indexes distribute the writes of dirty pages into more allocation areas. The solution would be to place the the clustered index and each non-clustered index into its own filegroup and place the three filegroups each onto separate LUNs on the RAID.

B. The low selectivity of the non-clustered indexes creates high contention between reads and writes (key conflicts as well as %lockres% conflicts) resulting in long lock wait times for both inserts and selects. Possible solutions would be using SNAPSHOTs with read committed snapshot mode, but I must warn about the danger of adding lot of IO in the version store (ie. in tempdb) on system that may already be under high IO stress. A second solution is using database snapshots for reporting, they cause lower IO stress and they can be better controlled (no tempdb version store involved), but the reporting is no longer on real-time data.

I tend to believe B) as the likely cause, but I must again stress the need to proper investigation and proper root case analysis.

'RAID10' is not a very precise description.

  • How many spindles in the RAID 0 part? Are they short-striped?
  • How many LUNs?
  • Where is the database log located?
  • Where is the database located?
  • How many partitions?
  • Where is tempdb located?

As on the question whether relational databases are appropriate for something like this, yes, absolutely. There are many more factors to consider, recoverability, availability, toolset ecosystem, know-how expertise, ease of development, ease of deployment, ease of management and so on and so forth. Relational databases can easily handle your workload, they just need the proper tuning. 30 million inserts a day, 350 per second, is small change for a database server. But a 32bit 4GB RAM system hardly a database server, regardless the number of CPUs.

Remus Rusanu
A: 

You don't say how the inserts are managed. Are they batched or is each statistic written separately? Because inserting one thousand rows in a single operation would probably be way more efficient than inserting a single row in one thousand separate operations. You could still insert frequently enough to offer more-or-less real time reporting ;)

APC