views:

273

answers:

7

I've got a bunch of tables that I'm joining using a unique item id. The majority of the where clause conditions will be built programatically from a user sumbitted form (search box) and multiple conditions will often be tested against the same table, in this case item tags.

My experience with SQL is minimal, but I understand the basics. I want to find the ids of active (status=1) items that have been tagged with a tag of a certain type, with the values "cats" and "kittens". Tags are stored as (id, product_id, tag_type_id, value), with id being the only column requiring a unique value. My first attempt was;

   select 
      distinct p2c.product_id 
   from '.TABLE_PRODUCT_TO_CATEGORY.' p2c
      inner join '.TABLE_PRODUCT.' p on p2c.product_id = p.id 
      inner join '.TABLE_PRODUCT_TAG.' pt on p.id = pt.product_id
      inner join '.TABLE_TAG_TYPE.' tt on pt.tag_type_id = tt.id
   where 
      tt.id = '.PRODUCT_TAG_TYPE_FREE_TAG.'
      and p.status = 1
      and lower(pt.value) = "cats"
      and lower(pt.value) = "kittens"

but that returned nothing. I realised that the final AND condition was the problem, so tried using a self-join instead;

   select 
      distinct p2c.product_id 
   from '.TABLE_PRODUCT_TO_CATEGORY.' p2c
      inner join '.TABLE_PRODUCT.' p on p2c.product_id = p.id 
      inner join '.TABLE_PRODUCT_TAG.' pt on p.id = pt.product_id
      inner join '.TABLE_PRODUCT_TAG.' pt2 on p.id = pt2.product_id
      inner join '.TABLE_TAG_TYPE.' tt on pt.tag_type_id = tt.id
   where 
      tt.id = '.PRODUCT_TAG_TYPE_FREE_TAG.'
      and p.status = 1
      and lower(pt.value) = "cats"
      and lower(pt2.value) = "kittens"

Now everything works as expected and the result set is correct. So what do I want to know? To re-iterate, the results I'm after are the ids of active (status = 1) items that have been tagged with a tag of a certain type, with the values "cats" AND "kittens"...

  1. Are self-joins the best way of achieving these results?
  2. This query has the potential to be huge (I've omitted a category condition, of which there may be ~300), so does this self-join approach scale well? If not, is there an alternative?
  3. Will the self-join approach be the best way forward (assuming there is an alternative) if I allow users to specify complex tag searches? ie "cats" and ("kittens" or "dogs") not "parrots".
+3  A: 
select 
  distinct p2c.product_id
from '.TABLE_PRODUCT_TO_CATEGORY.' p2c
  inner join '.TABLE_PRODUCT.' p on p2c.product_id = p.id
  inner join '.TABLE_PRODUCT_TAG.' pt on p.id = pt.product_id
  inner join '.TABLE_TAG_TYPE.' tt on pt.tag_type_id = tt.id   
where 
  tt.id = '.PRODUCT_TAG_TYPE_FREE_TAG.'
  and p.status = 1  
  and (lower(pt.value) = "cats" or lower(pt.value) = "kittens")
Greco
Works perfectly, but I'm after "cats" AND "kittens", not "cats" OR "kittens"
MatW
MatW, I think that you are heavily confusing what OR means. If you search for "cats OR kittens", then you will get all rows that have one OR the other, and this is logically the same as all rows that has "cats" PLUS all rows that have "kittens".
Svante
No, there is no confusion. I don't want items that have row matches for "cats" plus those that have row matches for "kittens", but items that have row matches for both, hence the need for an AND statement. Each item can have multiple tags assigned to it.
MatW
+4  A: 

wouldn't this work in your first query?

instead of

and lower(pt.value) = "cats"
and lower(pt.value) = "kittens"

do this

and lower(pt.value) in ("cats","kittens")
SQLMenace
Fantastic, I didn't even realise this syntax existed. Unless I'm mistaken though, this syntax is shorthand for;"and (lower(pt.value) = "cats" or lower(pt.value) = "kittens")"and I'm after "cats" AND "kittens", not "cats" OR "kittens"
MatW
can you have cats and kittens in the same row?
SQLMenace
they are not in the same row. It's "pt.value = "cats" and pt2.value = "kittens". This answer misquotes the question (or rather quotes the wrong part of the question).
le dorfier
No, each tag exists on a seperate row in the tag table.
MatW
?? I don't understand how this assertion answers the question.
le dorfier
Then please re-read. I answered a yes / no question with a no. The question itself seems off-topic, but I'm guessing SQLMenace knows more than me is going somewhere with his question! :)
MatW
pt.value and pt2.value is the same table just aliased because he did a self join...so these two things are in two different rows in the TABLE_PRODUCT_TAG table
SQLMenace
> I'm after "cats" AND "kittens", not "cats" OR "kittens"This makes no sense. The same row can't be two values at the same time. So you're either confused, or not explaining it well enough. Have you actually tried it to see what the OR query gives you? What's wrong with that result set?
Emtucifor
Is this what you are looking for for the one table...sql server syntax but you get the idea SELECT product_id FROM TABLE_PRODUCT_TAGWHERE tag IN ('cat','kittens')GROUP BY product_idHAVING COUNT(*) = 2
SQLMenace
A: 

The problem with the initial query was this:

  and lower(pt.value) = "cats"
  and lower(pt.value) = "kittens"

There exists no tag for which the value is both "cats" and "kittens", therefore no records will be returned. Using an IN clause as SQLMenace suggests would be the solution - that way you're saying, "give me back any active item that has been tagged 'cats' or 'kittens'".

But if you want any active item that has BOTH tags - then you need to do something like your second query. It's not perfectly clear from your question if that's what you're after.

For something like your Question #3:

"cats" and ("kittens" or "dogs") not "parrots".

you would want pt1, pt2, and (in a subquery) pt3, and something like this:

and lower(pt1.value) = "cats"
and lower(pt2.value) in ("kittens", "dogs")
and not exists (select * from '.TABLE_PRODUCT_TAG.' pt3 where pt3.product_id = p.id and lower(pt3.value) = "parrots")

The broadly general case could get quite messy...

Carl Manaster
You say "something like your second query", does this mean there is an alternative to self-joins that would acheive the same results? Also, I don't understand what you mean by "The broadly general case could get quite messy..."? FYI, I've edited my question to make things a little clearer.
MatW
I guess by "something like your second query" I mean a self-join, or a subselect. By "messy" - well, sometimes you need multiple joins to the tag table, sometimes you need an OR or an IN clause, sometimes you need a subselect (for NOT EXISTS); generalizing that is complex.
Carl Manaster
Gotcha, thanks. :)
MatW
A: 

Your answer is "yes, that's a scalable technique". As far as adding complexity, I think you'll overreach your users' ability to understand what they are doing before you have an efficient-query problem.

le dorfier
A: 

You're building yet another entity-attribute-value data model. Since you asked about scalability, here's a warning: EAV models usually don't scale and don't perform on top of RDBMS. Ultimately this 'flexible' data model ends up clobbering the optimizer and you'll be scanning millions and millions of rows to fetch your few dogs and kittens. Wikipedia has a topic covering this model and some of the downsides. Don't know what your target DB is, for instance SQL Server CAT published a white paper with common problems in the EAV model.

Remus Rusanu
Thanks for the resources (more learning, phew!). However, whilst the tag schema isn't as normalised as it could be, it isn't EAV either. I assume the condition constant called PRODUCT_TAG_TYPE_FREE_TAG was what confused things and inferred an EAV design, but users have no control over this value of this column. The name of the constant certainly is ambiguous though! (You can blame the previous code-monkey for that; I've inherited an awful lot of "oddities" with this project...)
MatW
A: 

Ok, let me re-state the question to make sure I understand:

You are trying to show all products that have two different specific tags ("cats" and "kittens") but the tags are stored in a 1-to-many table.

The double-join does work, but here's another alternative:

SELECT ...
FROM P
WHERE p.status = 1
  AND p.ProductID IN (SELECT Product_ID FROM tags WHERE value = "cats")
  AND p.ProductID IN (SELECT Product_ID FROM tags WHERE value = "kittens")

Just add additional AND statements depending on the options the user selects.

The SQL optimizer should actually treat this the same way it treats a join, so I don't think performance would scale any worse than your version. Worth testing with your dataset, though, to make sure.

BradC
A: 

AIR CODE

select 
  distinct p2c.product_id 
from '.TABLE_PRODUCT_TO_CATEGORY.' p2c
  inner join '.TABLE_PRODUCT.'     p  on p2c.product_id = p.id 
where 
  and p.status = 1
  and 2 = (
      SELECT  COUNT(1)
      FROM '.TABLE_PRODUCT_TAG.' pt
        INNER JOIN  '.TABLE_TAG_TYPE.' tt ON pt.tag_type_id = tt.id
      WHERE tt.id = '.PRODUCT_TAG_TYPE_FREE_TAG.'
      AND pt.product_id = p.id /* edit */
      lower(pt.value) IN( "cats", "kittens" )
)