views:

117

answers:

8

From Objective-C i'm used to Reference Counting / Retain Counting (same thing). I love this concept, and it's plain simple.

So I thought, why not apply this to a database? The problem is:

Imagine these tables: users, photos. A user can own a photo, a user can like a photo of himself or someone else, and a user can recommend a photo. In any case there's a reference created to the photo.

Now the big problem is: Lets say I want to get rid of any photo that belongs to nobody anymore. So a photo that is not owned by any user, is not liked by any user, and is not recommended by any user, has to be deleted. I have no idea how to do that with the plain database functionality ;) well, maybe some hundred joins? I don't know. Database noob.

So the idea I'd like to verify from DB experts: Imagine I add a reference_count field to every table which can be referenced (requires some thought upon setup, of course). As soon as I define a relationship from any table_a to any table_b, and table_a is supposed to be the parent or master (strong reference), I increase the reference_count of the linked row by 1.

Let's say 150 users like a photo and one owns it. So it's reference_count is 151. Now the owner gives it up but 150 others still scream "oh nooo, no no no!! it's ours! we like it so muuucch!!". The system looks once every midnight through all these tables and attempts to delete every row which has a reference count of 0. So after some days, the users get bored of it and remove their "I like it" flag. Every time this happens, the reference_count gets reduced by 1. At the end, it's 0, and the next midnight when everyone is asleep, a cron job deletes it.

Of course, it must be configurable, because it must not always be the case that the photo has to get deleted after nobody references that anymore. I would solve this by setting it's reference_count initially to 1, not 0.

This means of course a lot of additional db calls, I guess. So what do you think? Or are there better solutions for that?

+4  A: 

You can do just the following:

DELETE
FROM    photos
WHERE   id NOT IN
        (
        SELECT  photo_id
        FROM    photos_users_like
        )
        AND id NOT IN (
        SELECT  photo_id
        FROM    photos_users_made
        )
        AND id NOT IN (
        SELECT  photo_id
        FROM    photos_users_recommended
        )

If you index your photo_ids in all tables, the NOT INs will be optimized by MySQL so that the predicates will return FALSE when the engine finds but a single matching record in the corresponding tables and there will be no need in reference counts.

Quassnoi
Looks promising! Is that standard SQL, or specific to a particular database?
openfrog
Yes, this is standard `SQL`, optimized fairly well by all major engines. See here: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ for `MySQL` and browse the previous articles for other engines.
Quassnoi
+3  A: 

This problem is often solved using either

  1. An ORM layer that supports orphan-deletion, or
  2. A database trigger that deletes orphans

Using a reference count could be expensive, depending how often the data is modified, and will likely make your database code obscure and hard to follow.

skaffman
A: 

I think this is what cascading deletes are for. I would not be in favor of doing this. Reference counts are a fine idea for non-garbage collected languages, but in this case I think that SQL has had it covered for a long time.

This is why it's good to have a database administrator on hand. Noobs tend to get themselves into trouble.

duffymo
A: 

To find photos which no one owns, you would do something like:

select P.PhotoId
  from Photos as P
  where not exist (select ownerId from PhotoOwners as PO where PO.PhotoId = P.PhotoId)

And this can be extended to an arbitrary number of non-existence checks.

To prevent a referenced photos being deleted, you can make use of Foreign Keys.

Summary: RDBMSs are designed to solve these types of problems without reference counting and other (error prone) mechanisms being built by each application.

Richard
A: 

Bored users are unlikely to take the trouble to reverse their previous upvote. Why not count a more directly useful metric, like pageviews in the last month?

Richard Inglis
+1  A: 

You don't need to keep reference counts. I advise agaist it. Going by your description, you ucould use the following db structure:

create table users ( 
    id    int             not null auto_increment
,   name  varchar(64)     not null 
   ...more columns...
,   primary key (id)
)

create table photos ( 
    id             int          not null auto_increment
,   url            varchar(255) not null 
,   user_id_owner  int
,   primary key (id)
,   foreign key (user_id_owner) references users(id)
)

create table user_likes_photo (
    user_id  int not null
,   photo_id int not null
,   primary key(user_id, photo_id)
,   foreign key (user_id)  references users(id)
,   foreign key (photo_id) references photos(id)
)

create table user_recommends_photo (
    user_id_recommending  int not null
,   photo_id              int not null
,   user_id_recommended   int not null
,   primary key(user_id_recommending, photo_id, user_id_recommended)
,   foreign key (user_id_recommending)  references users(id)
,   foreign key (user_id_recommended)   references users(id)
,   foreign key (photo_id) references photos(id)
)

This way, you keep track of all relationships.

To remove unreferenced photo's you'd do:

delete from photos
where user_id_owner is null 
and id not in (
    select photo_id
    from   user_likes_photo
)
and id not in (
    select photo_id
    from   user_recommends_photo
)
Roland Bouman
+1. Foreign keys do what the OP is looking for. They can be configured to either force deletion to fail if it would violate foreign key constraints, or the deletion can be caused to cascade to dependent tables as well (e.g., you could have it automatically delete your likes and recommends rows)
Frank Farmer
+1  A: 

reference-counting should not be necessary; most databases will track relational-integrity for you

for example, in ms-sql if you executed

delete from photos where photoid = 1234

but that photo was referenced somewhere, the database will raise an error and refuse to delete the photo; in .NET code this manifests as a Sql exception

if you need to know in advance if a photo can be deleted, the EXISTS queries above will work; if you don't need to know in advance, then just try to delete it and let the database tell you that you cannot

reference-counting is a lot of extra work, and probably unnecessary; take advantage of the capabilities of the database in this regard

Steven A. Lowe
A: 

"take advantage of the capabilities of the database in this regard"

Amen to that.

And also amen to its logical conslusion that "if another DBMS has MORE capabilities in this regard, then switch to that DBMS immediately".

Erwin Smout