Maybe the solution is obvious, but I cant seem to find a good one.
In my upcoming project, there will be one main table, its data will be read frequently. Update / Insert / Delete speed is not an issue.
The items in that main table are associated to 4 or more categories. An item can have 50 - 100 or more relations within one category.
The most common operations that will be performed on the database:
- select all items that have been assigned to category A, B, C, ... with LIMIT X, Y
- count all items that have been assignged to category A, B, C, ...
My first thought on how to create a database for the above was something like this (classic approach I guess):
First, for each of the four categories I create a category
table:
id - PK, int(11), index
name - varchar(100)
then I will have one item
table:
id - PK, int(11), index
... some more data fields, about 30 or so ...
and to relate the category
tables, there will be 4 or more lookup / MM tables like so:
id_item - int(11)
id_category - int(11)
The queries looked something like this:
select
item.*
from
item
inner mm_1 on mm_1.id_item = item.id
inner join cat_1 on cat_1.id = mm_1.id_category and cat_1.id in (1, 2, ... , 100)
inner mm_2 on mm_2.id_item = item.id
inner join cat_2 on cat_2.id = mm_2.id_category and cat_2.id in (50, 51, ... , 90)
Of course the above approach with MM tables would work, but as the app should provide very good SELECT
performance, I tested it with real world amounts of data (100.000 records in the item
table, 50 - 80 relations in each category), but it was not as fast as I expected, even with indexes in place. I also tried using WHERE EXISTS
instead of INNER JOIN
when selecting.
My second idea was to just use the item
table from above denormalize the data.
After reading this blog post about using bitmasks I gave it a try and assigned each category a bit value:
category 1.1 - 1
category 1.2 - 2
category 1.3 - 4
category 1.4 - 8
... etc ...
So, if an item
was tagged with category 1.1
and category 1.3
, it had a bitmask of 5
, which I then stored in a field item.bitmask
and I can query it like so:
select count(*) from item where item.bitmask & 5 = 5
But performance was not so great either.
The problems with this bitmasking approach: mysql does NOT use any indexes when bit operators are involved and even when item.bitmask
would be of type BIGINT
I can only handle up to 64 relations, but I need to support up to 100 per category.
That was about it. I cant think of anything more except maybe polluting the item
table with many, many fields like category_1_1
up to category_4_100
each of the contains either 1 or 0. But that could lead to many AND
in the WHERE
clause of the select and that does not seem like a good idea, too.
So, what are my options? Any better ideas out there?
EDIT: as an response to Cory Petosky comment "What does "An item can have 50 - 100 or more relations within one category." mean?":
To make it more concrete, the item
table represents an image. Images are among other criterias categorized in moods (mood would be one of 4 categories). So it would look like this:
Image:
- Category "mood":
- bright
- happy
- funny
- ... 50 or so more ...
- Category "XYZ":
- ... 70 or so more ...
If my image table would be a class in C#, it would look like this:
public class Image {
public List<Mood> Moods; // can contain 0 - 100 items
public List<Some> SomeCategory; // can contain 0 - 100 items
// ...
}