views:

431

answers:

9

A lot of the LOB applications we provide to our customers are of a marketing / promotional nature (sweepstakes, event registration, etc...). Most of the applications, while very simple, are very demanding on the database. Imagine a "registration" type site as the backing for a commercial that airs during the superbowl, for example (yes, we've had several).

Though we have gotten very good at optimizing our web app code, the database always remains an issue, despite the application being relatively simple. The flow is typically something like:

  1. Read from database to detect existing record
  2. Write to database if record is new

In many cases, this is all the data access our application needs to perform. However, given that it is the sole purpose of the application, it's quite important that this simple process be optimized greatly.

For the purposes of this question, we have a single server running a raid 5 disk array for the data files and another raid 5 array for the logs. At this time, the OS is Windows 2003 standard 32bit and the server has 4 GB of memory. Some apps use SQL 2005 standard while others use MySQL 5.1. I'm very well aware that certain OS and hardware optimizations are possible here, but I'm looking to address my needs from a software side first. Extensive profiling has taught us that disk IO is generally the main bottleneck.

Having said all that, and knowing that caching won't help much since most reads are unique and return very little data (often only a bit indicating whether a record exists or not), I'm considering making a leap into the realm of in-memory databases as sort of a write-cache layer to the real database. This seems like a good fit given that most of our high volume traffic is sporadic in nature, and not sustained over several hours. Additionally, the potential loss of a few minutes of data due to a server crash would be acceptable in most cases.

In the simplest form, I would modify a typical registration app to do the following:

  1. Query the disk DB and memory DB for existing records
  2. If none, write data to the memory DB and return
  3. Periodically flush memory DB to disk DB

My question is: what are my options for this intermediate in-memory database? I've experimented with in-memory hash tables, data tables, and such, but I'm looking for other options or even suggestions for a completely different approach.

+1  A: 

SQLite has an in memory operation mode. This would work if you have a persistent server process behind your page hits handler.

Otherwise, regular file based DBs can be fooled into writing their files to a memory file system like tmpfs.

Ewan Todd
+2  A: 

If you don't need to know whether there is an existing record in real-time (i.e. it's important that the record gets in there, but you don't need to report whether it was new or existing to the user), you can structure your database in a way that allows for extremely fast write times without the need for an in-memory database, which carries a lot of potential problems if servers go down or worker processes restart.

Create two tables in your database for each table that are involved with this write-heavy flow. One table should be your "live" table and should be write-optimized as much as possible (i.e. no indexes and is never read from except when moving to the read table). Your other table should be your read-optimized table - indexed as appropriate for any reporting considerations, etc.

Whenever you're writing into your live table, ignore anything to do with whether a record is new or existing, or anything beyond just getting that data into the table as fast as possible and getting out of the DB. Set up a scheduled job that moves records from the live table into the read-optimized table, and worry about matching up existing records there. Ideally this would be done during non-peak times, but otherwise you might want to consider a third staging table so that there's no contention on the live table at any time.

Ryan Brunner
A: 

I don't know about the databases you mention, but if the content of the database (or at least the important table) fits into memory, oracle is able to pin it in the cache, so it basically behaves like an in memory database.

I'd also check on the isolation level settings of your database. If you are able to relax those you might be able to reduce locking.

Finally consider removing unique constraints, or disabling them for the peak times.

Jens Schauder
+1  A: 

In my view, you should be able to accommodate your workload with an RDBMS that has a user-sizeable cache. I'm seeing on the order of 10000 indexed records per second with a simple C++-callable RDBMS with ordinary hardware. That includes commit to disk. Further, since you may be looking at just one small field in a record, look for a column-oriented database -- one that stores data down the column. No point in reading in an entire row if you're only interested in one field.

John
+1  A: 

Optimising your database schema for writes rather than reads, as mentioned by many others, is your first point of call, although I guess you've been there already

Before investigating in-memory databases, you may want to have a look at some of the ORMs that are available, particularly NHibernate.

NHibernate keeps some data in memory and will allow you some control over when the data updates are 'flushed' from memory and sychronised with the database.

You might find it worth a look.

Mike
+2  A: 

Not programming related, but would definetly help: Get some of the newer Solid State Disks.

Yes they are expensive for the size, but since Disk IO is the bottleneck, just swapping out the current HDD's for some SSD's would greatly improve performance.

Neil N
+1  A: 

Edit: Concentrating strictly on the disk I/O...

  1. Rip out as many unnecessary indices as you can. Indices don't come for free--space OR time.
  2. Rip out any special triggers or constraints that you don't need.
  3. Rip out any entity relationships/relational integrity operators that aren't absolutely critical.
  4. If your current DBMS supports it, separate out the transaction tables into multiple disks (e.g., round-robin).
  5. Considering adding more database servers independent of each other (i.e., no replication involved); to do this, you need a scheduler to decide which server will accept the transaction and a scheme/separate process which consolidates the transactions.

Minimizing the amount of database logic and adding servers laterally (as opposed to bleeding-edge server technology) is basically the approach taken by ebay.

hythlodayr
+2  A: 

Embrace the new notion "Everything is a message, Database is the backup". When you have something to store, create a message and send it over to a blackbox (like eJabberD) using XMPP. Let the blackbox update your database on its own schedule. That is how sites like Twitter work.

Take a look at this slideshow: http://www.slideshare.net/kellan/beyond-rest

srini.venigalla
+1  A: 

Here's a strange idea: don't use a database for the initial capture. Design two or three screamingly fast indexed files, whose format doesn't need to change very often. Capture the data in those files.

Write some software that's appropriately triggered that copies captured data into a database, but doesn't delay the interactive user. Mark copied data to prevent duplicate copies, and to recycle space in the file.

Now you can design the database with the idea of sharing data between multiple uses, rather than with the idea of keeping up with the capture process. After all, sharing data is where databasesd really shine.

Walter Mitty
This is the basic idea behind how Google's Dapper distributed tracing tool works. Apps write to lcoal files, and collectors then copy those more lazily to BigTable.
fumanchu