views:

168

answers:

4

I have an interesting problem which I've been looking into and would appreciate some advice:

I'm trying to create a tool which mimics the basic capabilities of a requirements management tool as part of a company project.

The basic design is a Windows Explorer-like setting of folders and documents. Documents can be opened in a GUI, editted, and saved.

The document itself contains a hierarchical spreadsheet (think of Excel with Chapters, if that makes any sense). Each chapter contains rows, which are really just some requirements text + some other values which complement it. When displayed, the requirement text and attribute values show up as independent columns (much like Excel), with filtering capabilities.

Representing the user/permissions/folder hierarchy/etc for this type of program is pretty straightforward, but where I get hung up on is the document content itself...

My biggest concern is size and how it relates to performance: As part of this tool, I not only intended to store the current state of each document, but also the entire list of changes that have been made since day 1 (much like SVN), and then provide fast access to the history of changes.

On average, I expect ~500 documents in the repo; Each document will probably have ~20,000 active rows; Over the course of a year, it's not unreasonable to assume ~20,000 edits (meaning each document itself will acquire an additional 20,000 rows year-in and year-out).

Multiplied by the number of documents, that amounts to nearly 10,000,000 rows (with an additional 10,000,000 the next year, and the next year, and so on). Old histories can be purged, but it would only be performed by an admin (and it's not preferable that he/she do so).

As I see it, there are two ways for me to handle this situation:

  • I can try to represent a list all rows of all documents in a single table (much like how phpBB stores all posts of all forums in a single table), or...

  • I can try to store the rows of each document in a uniquely named table (meaning each document has it's own table); The table would have to be given a unique name, and a master table would contain the list of all documents and the table names that correspond to each.

So my question: Which really is preferable? Are neither really good options? Can anyone offer advice on which approach you would find more appropriate, given the needs?

A: 

There is nothing wrong with having many tables. It seems having many tables will be a more reasonable approach for you.

CookieOfFortune
-1. Please see Aleris' answer for some of the reasons why a multi-table approach is usually the wrong answer.
j_random_hacker
+2  A: 

A few points to consider with multiple tables approach:

  • Would be necessary to lookup information in all the documents? If yes you will need to search in all tables which is not such simple to achieve.
  • If the schema changes it is not simple to update the database as all the tables representing the same type of entity would need to change
  • Tracking information about user edits is also not very simple as it is split in multiple edits (eg: consider the scenario 'which documents did the user modified')

Have you considered alternative approaches to storing data? It is necessary to store each excel row in database as a table row? Storing data as xml and only save idexes on database? Or maybe store only tracking modifications and the document versions? The application can take a part of the database burden and do the filtering?

Aleris
+1. *Everything* gets harder when you use multiple dynamically-generated tables. Also consider that DBs are designed to handle a small number of very large tables efficiently -- they may or may not handle a large number of smaller tables well.
j_random_hacker
A: 

You may want to consider some sort of document management system. This sounds like something that SharePoint could do - it can be set to create a new version of a document when the document is checked in. Documents may also have meta-data assigned to them, and this can be required.

John Saunders
+3  A: 

If you are creating and/or destroying tables programmatically during the normal day-to-day operation of your application, I would say this is a very bad sign that something in the database design is wrong.

Database systems can and do handle tables with that many rows. To make any meaningful sorts of queries on that number of rows, you really do have to choose your indexes carefully and frugally. I mean, you really have to know intimately how the table will be queried.

However, I dare say it would be a good deal less complicated to implement than the approach you proposed of creating new tables arbitrarily based on IDs or numbers alone. And, with less complication comes greater ease of maintenance, and less chance that you'll introduce nasty bugs that are hard to debug.

If you are really keen on splitting into multiple tables, then I suggest that you look into how other people do data partitioning. Rather than creating tables dynamically, create a fixed number of tables from the start, based on how many you think you are likely to need, and allocate records to those tables based not on some arbitrary thing like how many records are in the tables at the time, but on something predictable - the user's ZIP code is an example given, or the category the document is in, or the domain name or country of the user who created it, or something logical that you can use to easily determine where a record ended up and it will be reasonably spread out.

One of the benefits of data partitioning this way, where you create all the partitions to start with, is that if you need to in future it's relatively easy to move to multiple database servers. If you are creating and destroying tables dynamically, that's going to make that less attainable.

thomasrutter