views:

4788

answers:

12

How would you design a database to support the following tagging features:

  • items can have a large number of tags
  • searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
  • creating/writing items may be slower to enable quick lookup/reading

Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.

Any ideas?

+7  A: 

I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"

Indices on cross table should be enough optimisation. Selecting appropriate items would be

SELECT * FROM items WHERE id IN  
    (SELECT DISTINCT item_id FROM item_tag WHERE  
    tag_id = tag1 OR tag_id = tag2 OR ...)

AND tagging would be

SELECT * FROM items WHERE  
    EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)  
    AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)  
    AND ...

which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.

Slartibartfast
+1  A: 

The easiest method is to create a "tags" table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag

Querying the data would be something like
Select distinct target from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]

-- Edit --
Based on your requirement to AND the conditions, the query above would turn into something like this

Select target from (select target, count(*) cnt from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]) where cnt = [number of tags being searched]

Brad Bruce
A: 

You won't be able to avoid joins and still be somewhat normalized.

My approach is to have a Tag Table.

 TagId (PK)| TagName (Indexed)

Then, you have a TagXREFID column in your items table.

This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:

 TagXrefID | ItemID | TagId

So, to get all tags for an item would be something like:

SELECT Tags.TagId,Tags.TagName 
     FROM Tags,TagXref 
     WHERE TagXref.TagId = Tags.TagId 
         AND TagXref.ItemID = @ItemID

And to get all items for a tag, I'd use something like this:

SELECT * FROM Items, TagXref
     WHERE TagXref.TagId IN 
          ( SELECT Tags.TagId FROM Tags
                WHERE Tags.TagName = @TagName; )
     AND Items.ItemId = TagXref.ItemId;

To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = @TagName1 AND Tags.TagName = @TagName2 etc...and dynamically build the query.

FlySwat
+4  A: 

You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.

This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.

However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.

EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.

Zizzencs
A: 

Thanks for all the answers so far.

If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)

Christian Berg
+3  A: 

This is (almost) the same question as How do you recommend implementing tags or tagging

David Schmitt
A: 

What I like to do is have a number of tables that represent the raw data, so in this case you'd have

Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)

This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.

A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...

CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)

Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.

Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.

In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that

0000

If all four tags are assigned to an object, the object would look like this...

1111

If just the first two...

1100

Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.

Check this link to find out more.

digiguru
A: 

To paraphrase what others have said: the trick isn't in the schema, it's in the query.

The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.

The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".

I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.

Portman
You probably shouldn't refrain from giving tidbits about a certain technology because other people trying to work in this problem domain may actually be using that technology and would benefit.
Redbeard 0x0A
A: 

I'd second @Zizzencs suggestion that you might want something that's not totally (R)DB-centric

Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.

I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.

You might want to consider if the added complexity is worth it.

chakrit
+9  A: 

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

Troels Arvin
I have to say that answer is a bit short sighted, because using a bit field type of the database limits you to a specific number of bits. This does not mean each item is limited to a certain number of tags, but that there can only be a certain number of unique tags in the whole system (usually up to 32 or 64).
Mark Renouf
Assuming a 3nf implementation (Question, Tag, Question_has_Tag), and a bitmap index on the Tag_id in Question_has_Tag, the bitmap index has to rebuilt every time a question has a tag added or removed. A query like `select * from question q inner join question_has_tag qt where tag_id in (select tag_id from tags where (what we want) minus select tag_id from tags where (what we don't)` should be fine and scale out assuming the right b-tree indexes exist on the middle table
Adam Musch
"This article" link is dead. I would have liked to read that :(
Mark
Mark: This one looks good: http://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/It's probably a re-published version of the one I referred to.
Troels Arvin
+21  A: 

Here's a good article on tagging Database schemas:

http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html

along with performance tests:

http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html

Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.

Jeff Atwood
Would you mind sharing how you went about it with SO?
objektivs
I would also love to have more detailed technical insight on how you implemented the tagging system with SO? I think on a podcast you said you keep all tags in a column with every question and then serialize/de-serialize them on the fly? I would love to know more about it and maybe see some code snippets.I've been looking around and having found any details, is there a link where you've already done this before I ask the question on META?
Marston A.
This question on Meta has some info on the SO schema: http://meta.stackoverflow.com/questions/1863/so-database-schema
Barrett
+5  A: 

I just wanted to highlight that the article that @Jeff Atwood links to (http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.

ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.

Winston Fassett
Nice article and performance benchmarks as well.
Wil