tags:

views:

563

answers:

3

Hi everyone,

I'm coding a website in PHP/MySQL and I'd like to implement a similar to stackoverflow tagging engine. I have 3 relevant tables in DB: 1. Items 2. Tags 3. ItemTagMap (maps tags to items, n:n mapping)

Now, on search page I'd like to show distinct list of all tags for entire search result (not just the current page), so that users can "refine" their search by adding/removing tags from that tag list.

The question is that it's a pretty heavy query on the DB and there can be tons of search requests that result in different result sets and thus different tag sets.

Does anyone know how to implement this effectively?

A: 

Assuming:

  • Item (id);
  • Tag (id, name) with index on name;
  • ItemTag (item_id, tag_id).

then:

SELECT t.name
FROM Tag t
WHERE EXISTS (SELECT 1 FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

Nothing intensive about that. This is similar but my guess is it would be slower:

SELECT t.name
FROM Tag t
WHERE t.id IN (SELECT tag_id FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

This can be done as a join as well:

SELECT DISTINCT t.name
FROM Tag t
JOIN ItemTag i WHERE i.tag_id = t.id
WHERE i.item_id = 1234
ORDER BY t.name

I think the first one will be faster but as is always the case with SQL, it's worth testing (on a sufficiently sized data set).

The above have been done to list the tags for a single item. You want a composite set of tags for search results. That's not difficult from the above but it depends on how you get your search results.

cletus
I'm not sure this answers the OP at all. The underlying search (from the site users) will produce many item_id values. I doubt you suggest that one should run a search for each of these ids individually...
mjv
@mvj It's a simple modification to take it to that level. To compare tags to multiple items, do `...WHERE item_id IN (...)...`. Also, to narrow results down by tag, just add to the clause `...WHERE item_id IN (...) AND tag_id IN (...)...`
Justin Johnson
@cletus Ok on the item_id IN (...) part. The narrowing down based on Tad_Id, however, would require to join the ItemTag table multiple times. Right ?   (unrelated) how do you do the background color in comments? Pretty cool.
mjv
A: 

You'll want to try to minimize the number of DB calls, putting the heavy work into PHP.

First, select all your Items from the DB:

select * from items where (conditions);

Then, create an array of all id's from the result set.

$ids = array();
foreach ($items as $item) {
    $ids[] = $item['id'];
}
$ids = implode(',' $ids);

Then select all ItemTagMaps and associated tag data for the Item ID's you previously retrieved.

select map.item_id, t.id, t.name from tags t, item_tag_maps map where t.id = map.tag_id and map.item_id in ($ids);

Now when you loop through your $items array, you can locate all matching tags from the 2nd SQL query you performed as long as it has a matching item_id value.

Matt Huggins
Wouldn't be more efficient to do following? select * from items where (conditions); select t.name from tags t inner join item_tag_maps map on t.id = map.tag_id inner join items on map.item_id = item_id WHERE {same condition goes here...} ?
Michael D.
Also, your method retrieves from database whole table of items even if I am going to display one page of data only. Using my method I can add LIMIT () to the first select to bring minimum data
Michael D.
No Michael, you're mistaken. Note the conditionals that I pass to each SELECT statement. The second SELECT only retrieves the tags for matching item_id's that were retrieved in the first SELECT statement, and the first SELECT statement should match the conditions of the first page.
Matt Huggins
+2  A: 

Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.

SELECT T.Tagid, TagInfo.TagName,  COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T  ON I.ItemId = T.ItemId 
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
  (
      SELECT ItemId 
      FROM Items
      WHERE   -- Some typical initial search criteria
         Title LIKE 'Bug Report%'   -- Or some fulltext filter instead...
         AND  ItemDate > '02/22/2008'
         AND  Status = 'C'
  )
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC

The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow) Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)

Discussion... The "driving query", or a variation thereof is needed for two distinct purposes:

  • #1 to provide the complete list of ItemId which is needed to enumerate all associated tags.
  • #2 to provide the first N ItemId values (N being the display page size), for the purpose of looking up Item detail info in the Item table.

Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).

This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.

One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).

The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.

Optimization ideas

  • [Depending on the typical search cases and on the data stats] it may make sense to denormalize by bringing (indeed duplicating) some of Items' fields to the ItemTagMap table. Short fields in particular may be 'welcome' there.
  • As the data grows in the million+ Items, we could exploit the typically strong correlation of some tags (ex: in SO, PHP often comes with MySql, btw often for no good reason...), with various tricks. For example the introduction of "multi-Tag" TagIds could render the input logic a bit more complicated, but could also reduce the Map size significantly.


-- 'nough said! --
Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...

mjv
This is an amazing post I want you
Joshua
@Joshua: thanks for the kind words.
mjv