views:

69

answers:

1

We develop and operate a blogging application in which user data a scattered across many tables:
- Blog
- Article
- Comment
- Message
- Trackback
- 50 other tables.

Users are able to close their account, and their account/contents must disappear from the site right away.

For legal/contractual reasons, we also must be able to undelete their account/content for a given duration, and also to make those data available for juridic authorities for another duration.

Over the years and different applications, we used different approaches:

  • "deleted" flag everywhere : Each table has a "deleted" column, which is updated when data is deleted/restored. Very nasty because it slows down every list generation queries, creates a lot of updates upon deletion/restore. Also, it does not handle the two stage deletion described above. In fact we never used this one, but it's worth dis-advising it :)

  • "Multi table": For each table, we create a second table with the same schema plus two extra fields (dateDeleted, reason). The extra fields are used to know if the data is still accessible for restoration, when to delete it, and why/how it was deleted in the first place. This version is just a bit better than the previous version, but can be very nasty performance wise too when tables are growing. Also, you have to change the schema of some tables (ie: remove UNIQUE constraints) which makes the system harder to understand/upgrade for new developers, administrators ... and mentally healthy people in general.

  • "Multi DB": Same approach as before, but we move data on a different database cluster, which allows to browse those data without impacting the "end users" db. Also, for this app, the uniqueness constraint is done at the java level, so all the schemas are the same. Lastly, the double data retention constraint is done by having a dedicated DB for each constraint, which makes things easiers.

I have to admit that none of those approaches satisfies me, even if they can work up to a certain amount of data. I have also imagined that we could just delete some key rows in the DB, and let the rest inconsistent (and scheduled for a more controlled deletion job), but it scares me ...

Do you know other ways of doing the same thing, keeping the same level of features (we could align the two durations to simplify the problem) ? I'm not looking a solution for my existing apps, but would like to improve the next ones.

Any input will be highly appreciated !

+3  A: 

It seams that every asset (blog, comment, ...) relies on the user. I would give the user table a column "active" which is 0 or 1, Then you implement a feature to ask on each query for the different asset "user active"? Try to optimize this lookup with indizes or something like that. In my opinion its the cleanst way. After this you can implement a job, which runs a cascading delete on users disabled for longer then x days.

Mork0075
+1 - For following the: Do not delete. Idea behind DBs.
windfinder
I guess that some asset would rely also on the blog object, but I get the idea. Thanks a lot !
MatthieuP