views:

176

answers:

3

How to design data storage for huge tagging system (like digg or delicious)?

There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?

The tagging system basically has 3 tables:

Item (item_id, item_content)

Tag (tag_id, tag_title)

TagMapping(map_id, tag_id, item_id)

That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.

For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.

But, how to partition table TagMapping?

The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.

As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.

Is there any better solution to solve this many-to-many partition problem?

+1  A: 

The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.

But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:

  1. Choose partition from title (for Tag) or URL (for Item).
  2. Choose sequence for this partition to generate id.

You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.

Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.

Denis Otkidach
+4  A: 

I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!

Finding Tags for a Given Item

A1. You're going to display all of the tags for a given item at once

A2. You're going to ensure that all of an item's tags are unique

Finding Items for a Given Tag

B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)

B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags

B3. You're going to sort the items for a given tag (or tags) by some measure of popularity

Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).

Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.

The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.

Michael Petito
I'm not sure that idea of combined item_id's is good. Popularity changes in time. It's also hard to guess popularity/page rank/whatever while creating record (that's the moment in time, when combined item_id should be calculated).
Wacek
Yes, I agree and would not normally recommend encoding values into an ID. Popularity does change over time but not too quickly. If moving items between partitions is not too difficult, it still may be an OK approach. You might use another ID for each item in `TagMapping` that can more easily change over time (instead of the item primary key which is likely used in many other places). A background process could incrementally re-calculate these new IDs and re-organize records in `TagMapping` to reflect changes in popularity.
Michael Petito
+1  A: 

Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.

You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.

The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.

Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.

PS: Don't forget about caching, it'll save you more than distributed-DB.

Wernight