Database schema for large web apps

views:

452

answers:

+1 Q:

Database schema for large web apps

Many of the SaaS web app services have a company based concept. So, each company using the service has its own set of users, files, and other data. How do the web apps usually handle this on the DB side? Do they create a new database for every company (containing the data tables related to that company) ? Or do they have some kind of company_id relation to select the relevant data from a single DB ?

+2 A:

Definitely the company_id.

Creating a new table for every anything - let alone a new DATABASE - would be absurd (and indeed is the fodder for many a Daily WTF post).

That's the whole point of using a relational database - you link things together.

If you need to then make the db bigger, there are tons of ways to do it (master/slave, master/multislave, dual master, horizontal scaling, just buying a ton of RAM, etc).

FWIW: my last app had ~12 million users (~300k per day); it had two databases (horizontal scaling; done by the previous guys. I didn't agree with that decision, and would've just used slaves).

EDIT: Caveat - this is assuming that you are only exposing access via your app (either its web interface or an API).

If you really need to expose the database directly to customers, a) tell them to think again because it's a bad idea, and b) then you may need to make hard choices between what's easier to maintain that preserves the needed firewalling. But srsly, you don't want to go there if you can help it.

Sai Emrys 2009-05-10 15:19:30

No, I am not looking at exposing the DB directly. My concern was just the performance in the longer run. And yeah, it now looks silly to be replicating DBs - any changes in schema, backups would be horrendous.

2009-05-10 16:26:30

This completely depends on how much data per company we are talking about. In a general case, company_id may work well, although if you are, say, Salesforce, there isn't a master/slave architecture that can hold you, and you need horizontal scaling anyway (although not at the rate of 1 DB per company). If you are Facebook, you create a new MySQL instance per university, and that lets you scale from the get-go, although it introduces cross-db query headaches.

SquareCog 2009-05-10 16:54:01

Actually I am not sure about Salesforce. Given how slow it is, it just might be running on a single giant Oracle RAC instance, and trying to burn through everyone's data when generating your reports :-).

SquareCog 2009-05-10 16:56:10

SquareCog - When he's talking billions of rows, that's the time to worry about making major sacrifices to the design to support massive scale. But he's not, and most likely never will be. There are very very few Facebooks.

Sai Emrys 2009-05-11 03:27:36

+1 A:

I recently found out there is a name for this sort of thing in SaaS, when an application and database is shared between companies:

Multitenancy

codeulike 2009-05-10 15:38:41

+4 A:

We had exactly this issue a few months ago in a product that is often used by organizations that, in turn, serve multiple clients. They came to us asking us to modify our SaaS system so they could create complete, discrete web sites for each of their clients (we build an online, domain-specific web-site construction tool).

A short summary: it may seem obvious to put everyone on a single database but, as you probe deeper, you'll find it isn't always cut-and-dry. There are a few challenges that you'll want to keep in mind as you proceed. A couple of points:

First, it isn't enough to just add "Company_id" to a few tables. Indeed, despite Sai's comments about it being ridiculous to have a database/app for each company there are absolutely cases where this makes sense due to the underlying complexity of hosting SaaS systems for multiple, discrete clients. If you are simply serving a few different companies (e.g. creating invoices for them) then Sai's comment is quite true. If you are providing a software application to multiple organizations, however, the complexity is quite a bit higher and discrete databases may well be in order.

Second, be prepared for a significantly more complex user querying and reporting effort in a multi-client database. For example, when building our user-querying capabilities we had to be absolutely certain that there would be no "bleed-through" between organizations as there was HIPAA-protected data involved. This meant that the querying and reporting capabilities required a level of engineering far in excess of what had gone before. In our case, our querying capabilities were very flexible and essentially permitted users to construct queries on the fly (subject to some pretty stiff constraints, obviously - we weren't accepting SQL!). Thus, we had to make sure that every query was automatically modified to use the "Company_ID" constraint, as appropriate, no matter what the origin of the data or the permissions of the staff member submitting the query. The wrinkle? Our 'super-user' analysis account had to be able to run the queries without such a constraint...

Third, you probably do not yet anticipate just how many things need to be separated. For example, I had built a quite sophisticated "Settings" object into the site that pulled settings from the database on startup and maintained them in the "Application" object (this is a .NET app). This all needed to be floated to handle multiple organizations.

For another example, fields that used to be unique for us (e.g. logins) now had to be done as part of a Company_ID, LoginID key. If you are building from scratch, this isn't such a big idea, but we were retrofitting so it was.

Anyway, as I proceeded through the build, I was surprised to find out just how much work was required to do this right.

Fourth, I always build software using a "meta-programming" approach. That is, I rarely build a single-purpose page but rather often build a highly customizable framework in order to facilitate end-user customization and internal code reuse. While I anticipated that this would help with a transition to multi-organization databases, it often did not! Because such coding is often fairly complex to begin with, floating the Organization was often more difficult than if I simply had a vanilla web page.

Finally, if there is no crying need to share data (e.g. analysis of overall usage patterns) then you might want to stick with discrete databases simply to facilitate scaling. While you add new multi-org databases (a second discrete system), our scaling often involved existing clients that suddenly experienced a surge in growth. Peeling them out of an existing database and onto a new server is a bit more difficult than just moving to a new server with an existing database.

With all of these caveats, you might think I'd advise you against building a system capable of handling multiple organizations on a single database. However, this isn't the case: there are some real wins taking a multi-org approach! Usage analysis, cross-organizational reporting, application deployment, etc. are all significantly enhanced. I just want to provide you the benefit of our experience in the hopes that it'll help you anticipate some of the difficulties that you may anticipate.

Mark Brittingham 2009-05-10 16:43:33

Excellent answer, wish I could upvote a few times. This is the part where I start waving my horizontal scaling, share-nothing architecture, AsterData, Greenplum, Vertica, Netezza flag.

SquareCog 2009-05-10 16:57:48

Lol - thanks SquareCog - I appreciate the comment.

Mark Brittingham 2009-05-10 17:08:07

If you're providing an *application*, then yes, it's a different thing - then you're talking just providing code that they run, and therefore they have their own DBs.However, let's face it: it's extremely unlikely he's anywhere near that size. When he gets there, then he can deal with it. In the meantime, trying to do this kind of horizontal scaling will only make things more complicated.

Sai Emrys 2009-05-11 03:24:35

+1 A:

When running software as a service there's always a few things to consider when choosing a database strategy. Two arguments for a separate database per client are backups and (feeling of) security. If you have one database with a discreet customer_id field and customer 666 screws up and wants his data of yesterday restored, you're in for some work.

Have a single database per customer is also sometimes required by that customer because the data might be sensitive. He could rightfully argue that is it more save to put the data in different databases and setup good security.

-Edoode

edosoft 2009-05-10 17:56:48

ansaurus

tags:

views:

answers:

Database schema for large web apps

related questions