views:

77

answers:

4

We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.

Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.

My question is: how do you know when a separate database for reporting is required?

What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?

+1  A: 

The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.

As for questions, here are some basic one:

  1. Can I do the high intensity reports during non-peak hours?
  2. Does it interfere with the users using the system?
  3. If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?
Corith Malin
databases are designed to negate this issue. A properly set up database and reporting system should not have an issue.
northpole
Not all databases are perfect, and sometimes taking the approach of throwing more hardware at a problem is much cheaper than hiring out developers, DBAs, project managers, etc... to recode the application for a more optimized database design. Your comment is similar to saying that jet engines negate all the issues of prop planes. While that is true, a prop plane is still a whole lot cheaper to build, fly, and maintain (also doesn't require as high a level of pilot).
Corith Malin
+2  A: 

Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:

  • Reporting consuming inordinate amount of database server resources impacting the app's DB performance.

    A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.

  • Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.

  • Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work

  • Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.

DVK
Properly indexed tables and tuned queries would hopefully make this a non issue. I am not convinced performance would require separately maintained databases.
northpole
@northpole - sometimes you just can't tune any more. If a query at max tuning takes 1 hour to run, you can't wish that away. The above answer kinda assumes you already tuned as best as it could be.
DVK
@DVK, and you next solution would be to create and maintain a separate but equal database? Why not consider new hardware, or a new Database all together, like an Oracle cluster? You suggest timing is your main reason, but how can you expect good performance when you have to do real time replication in order to have current data in both.
northpole
@DVK, I don't buy that the correct assumption is that you already tuned the database as best as possible. I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
northpole
@northpole - if your first guess is "separate but equal", you are putting words into my mouth that weren't there. And I'm sure you are better DB tuner than a bunch of fairly good developers and a set of high-level DBAs for one of the major and very technology-centric financial firms out there, especially on a DB and a codebase and a problem you never saw in your life.
DVK
@DVK, I was not trying to put words in your mouth, but I had to assume based off the info you provided. If you were trying to explain something like Cade Roux, then my apologies for generalizing your answer.
northpole
@northpole - one advantage of a new database over new hardware or a cluster is cost. Yes the additional complexity (cost) of a second DB has to be taken into account but I doubt the costs are in the same ball-park. Also, depending on the organisation, operational costs associated with managing a new db come out of a different budget from the captial costs associated with new hardware - so there's also a political / fiscal dimension to the problem.
Adrian K
+1  A: 

In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.

  1. When transaction performance is critical.
  2. When it's hard to get a maintenance window on the transactional app.
  3. If reporting needs to correlate results not only from this app, but from other application silos.
  4. If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
  5. If the reports are long running.
  6. If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
  7. If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).

It adds non-trivial complexity, so imo, there has to be a good reason to split.

Rob
Thanks - I like your concise list of points to consider.
Adrian K
+3  A: 

Typically, I would try to report off the transactional database initially.

Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.

When you do go to a reporting database, remember there are only a few reasons you are going there:

Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.

Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.

Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).

Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.

Cade Roux
+1 This is the only way it should be considered/done.
northpole
Thanks - I like your reasoning and explainations.
Adrian K