views:

116

answers:

3

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)

What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.

What can I do to counteract effects like that.

  • Do I index the delete field?
  • Do I move the deleted data to an identical delete table and back when undeleted?
  • Do I spread out the data over a few MySQL servers over time? (based on growth)

I'd appreciate any and all suggestions or stories.

UPDATE:

So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.

So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)

Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)

+3  A: 

I'd partition the table on the DELETE flag.

The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.

Quassnoi
That's a pretty interesting point. I Didn't know this :)
Ólafur Waage
@Olafur: http://dev.mysql.com/doc/refman/5.1/en/partitioning-list.html
Quassnoi
Thanks .
Ólafur Waage
Needs mysql 5.1, note.
chaos
@chaos Hmm, that will pose a problem.
Ólafur Waage
+4  A: 

Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.

chaos
Yes, there are actually few queries that don't check the delete field for some value.
Ólafur Waage
Right; I didn't actually quite mean literally "all the time", more like "very commonly". If most of your queries go against the delete field, that certainly seems to call for indexing it.
chaos
+1  A: 

Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.

If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).

Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)

Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]

Good luck. BTW, I'm not a DBA so take this with a grain of salt.

protoscript
Very nice answer, thank you :)
Ólafur Waage