Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.
SELECT T.Tagid, TagInfo.TagName, COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T ON I.ItemId = T.ItemId
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
(
SELECT ItemId
FROM Items
WHERE -- Some typical initial search criteria
Title LIKE 'Bug Report%' -- Or some fulltext filter instead...
AND ItemDate > '02/22/2008'
AND Status = 'C'
)
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC
The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow)
Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)
Discussion...
The "driving query", or a variation thereof is needed for two distinct purposes:
- #1 to provide the complete list of ItemId which is needed to enumerate all associated tags.
- #2 to provide the first N ItemId values (N being the display page size), for the purpose of looking up Item detail info in the Item table.
Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).
This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.
One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).
The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.
Optimization ideas
- [Depending on the typical search cases and on the data stats] it may make sense to denormalize by bringing (indeed duplicating) some of Items' fields to the ItemTagMap table. Short fields in particular may be 'welcome' there.
- As the data grows in the million+ Items, we could exploit the typically strong correlation of some tags (ex: in SO, PHP often comes with MySql, btw often for no good reason...), with various tricks. For example the introduction of "multi-Tag" TagIds could render the input logic a bit more complicated, but could also reduce the Map size significantly.
-- 'nough said! --
Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...