tags:

views:

65

answers:

3

I've got a MySQL database with typical schema for tagging items:

item (1->N) item_tag (N->1) tag 

Each tag has a name and a count of how many items have that tag ie:

item 
(
 item_id (UNIQUE KEY) 
)

item_tag 
(
 item_id (NON-UNIQUE INDEXED), 
 tag_id (NON-UNIQUE INDEXED)
)

tag
(
 tag_id (UNIQUE KEY)
 name
 count
)

I need to write a maintenance routine to batch re-tag one or more existing tags to a single new or existing other tag. I need to make sure that after the retag, no items have duplicate tags and I need to update the counts on each tag record to reflect the number of actual items using that tag.

Looking for suggestions on how to implement this efficiently...

A: 

An index/constraint on the item_tag table can prevent duplicate tags; or create the table with a composite primary key using both item_id and tag_id.

As to the counts, drop the count column from the tag table and create a VIEW to get the results:

CREATE VIEW tag_counts AS SELECT tag_id, name, COUNT(*) AS count GROUP BY tag_id, name

Then your count is always up to date.

Tony
I'd considered this for maintaining the counts but how would this perform on a large set of items? Given that batch retagging is rare, and retrieving the counts frequent, a denormalized count column is I think justified.
cantabilesoftware
If you are going to have millions of tags with infrequent update and frequent read, then yes, calculating the count each time is not the best solution. I just thought I'd mention it.An alternative might be a materialized view, but I'm not sure that will help you since you still have to refresh them. (I also don't know if MySQL supports them, I mainly use SQL Server) :)
Tony
+1  A: 

if i understood you correctly then you could try something like this:

/* new tag/item table clustered PK optimised for group by tag_id 
  or tag_id = ? queries !! */

drop table if exists tag_item;
create table tag_item
(
tag_id smallint unsigned not null,
item_id int unsigned not null,
primary key (tag_id, item_id), -- clustered PK innodb only
key (item_id)
)
engine=innodb;

-- populate new table with distinct tag/items

insert ignore into tag_item 
 select tag_id, item_id from item_tag order by tag_id, item_id;

-- update counters

update tag inner join
(
select 
 tag_id,
 count(*) as counter
from
 tag_item
group by
 tag_id
) c on tag.tag_id = c.tag_id
set
 tag.counter = c.counter;
f00
Interesting - would this be faster than a simply updating the tag with a select count(*)? (assuming appropriate indicies present).
cantabilesoftware
my understanding was...due to lack of entity integrity, namely a PK on item_tag table, that he had duplicate data (1,2),(1,3),(1,2) <-- oops and also required a re-mapping of some tags (which i omitted from my example as i'm lazy) so a re-count simply won't cut it. If you're worried about my choice of clustered PK for tag_item you might want to look at my response here http://stackoverflow.com/questions/3534597/rewriting-mysql-select-to-reduce-time-and-writing-tmp-to-disk/3535735#3535735
f00
A: 

This is what I've got so far, which seems to work but I don't have enough data yet to know how well it performs. Comments welcome.

Some notes:

  • Had to add a unique id field to to the item_tags table get the duplicate tag cleanup working.
  • Added support for tag aliases so that there's a record of retagged tags.
  • I didn't mention this before but each item also has a published flag and only published items should affect the count field on tags.
  • The code uses C#, subsonic+linq + "coding horror", but is fairly self explanatory.

The code:

public static void Retag(string new_tag, List<string> old_tags)
{
    // Check new tag name is valid
    if (!Utils.IsValidTag(new_tag))
    {
        throw new RuleException("NewTag", string.Format("Invalid tag name - {0}", new_tag));
    }

    // Start a transaction
    using (var scope = new SimpleTransactionScope(megDB.GetInstance().Provider))
    {
        // Get the new tag
        var newTag = tag.SingleOrDefault(x => x.name == new_tag);

        // If the new tag is an alias, remap to the alias instead
        if (newTag != null && newTag.alias != null)
        {
            newTag = tag.SingleOrDefault(x => x.tag_id == newTag.alias.Value);
        }

        // Get the old tags
        var oldTags = new List<tag>();
        foreach (var old_tag in old_tags)
        {
            // Ignore same tag
            if (string.Compare(old_tag, new_tag, true)==0)
                continue;
            var oldTag = tag.SingleOrDefault(x => x.name == old_tag);
            if (oldTag != null)
                oldTags.Add(oldTag);
        }

        // Redundant?
        if (oldTags.Count == 0)
            return;

        // Simple rename?
        if (oldTags.Count == 1 && newTag == null)
        {
            oldTags[0].name = new_tag;
            oldTags[0].Save();
            scope.Complete();
            return;
        }

        // Create new tag?
        if (newTag == null)
        {
            newTag = new tag();
            newTag.name = new_tag;
            newTag.Save();
        }

        // Build a comma separated list of old tag id's for use in sql 'IN' clause
        var sql_old_tags = string.Join(",", (from t in oldTags select t.tag_id.ToString()).ToArray());

        // Step 1 - Retag, allowing duplicates for now
        var sql = @"
            UPDATE item_tags
                SET tag_id=@newtagid
                WHERE tag_id IN (" + sql_old_tags + @");
            ";

        // Step 2 - Delete the duplicates
        sql += @"
            DELETE t1 
                FROM item_tags t1, item_tags t2
                WHERE t1.tag_id=t2.tag_id 
                    AND t1.item_id=t2.item_id
                    AND t1.item_tag_id > t2.item_tag_id;
            ";

        // Step 3 - Update the use count of the destination tag
        sql += @"
            UPDATE tags 
                SET tags.count=
                    (
                        SELECT COUNT(items.item_id)
                        FROM items
                        INNER JOIN item_tags ON item_tags.item_id = items.item_id
                        WHERE items.published=1 AND item_tags.tag_id=@newtagid
                    )
                WHERE
                    tag_id=@newtagid;
            ";

        // Step 4 - Zero the use counts of the old tags and alias the old tag to the new tag
        sql += @"
            UPDATE tags
                SET tags.count=0,
                    alias=@newtagid
                WHERE tag_id IN (" + sql_old_tags + @");
            ";

        // Do it!
        megDB.CodingHorror(sql, newTag.tag_id, newTag.tag_id, newTag.tag_id, newTag.tag_id).Execute();
        scope.Complete();
    }
cantabilesoftware
well your first and primary mistake is that you're not doing it server side - i didnt analyse your code further than that so can't comment anymore.
f00
Not sure I understand. There's a few bits up front that are done in code - but they're minor (get the id's of a couple of tags). The bulk of the work is just 4 sql statements - they're not in loop or anything.
cantabilesoftware