views:

104

answers:

5

I have a small video site where I want to get related videos on the basis of the most matched tags. What would be the best MSSQL 2005 query to get the related videos?

A LINQ query would be appreciated as well.


Schema:

CREATE TABLE Videos
    (VideoID bigint not null , 
    Title varchar(100) NULL, 
    isActive bit NULL  )

CREATE TABLE Tags
    (TagID bigint not null , 
    Tag varchar(100) NULL )

CREATE TABLE VideoTags
    (VideoID bigint not null , 
    TagID bigint not null )

Each video can have multiple tags. Now I want to get related videos on basis of tags but only those videos which match most tags. The most matched videos should come on top and the less matched should be on the bottom, if no tags matched then it should not return any videos.

Also I want to know that above schema is ok if I have say more than a million videos and 10-20 tags for each video.

A: 

Take a look at my post here: http://stackoverflow.com/questions/1051927/sql-to-find-most-popular-category/1051963#1051963 which covers similar concepts.

Andrew Siemer
A: 

Is something like this what your after?

String horror = "Horror";
String thriller = "Thriller";

var results =
    from v in db.Videos
    join vt in db.VideoTags on v.VideoId equals vt.VideoId
    join t in db.Tags on vt.TagId equals t.TagId
    where
        t.Tag == horror || t.Tag == thriller
    select v;
Chalkey
A: 

This query will get the videos ordered by number of tags associated (in descending order):

select video.videoId, Title, count(*) nroOfTags
from videos, VideoTags
where
videoTags.videoid = videos.videoID 
and tagId in ('horror','action','adventure')
group by video.videoId, Title
order by count(*) desc

Regarding the data model, it's OK. It will work well assuming all the proper indexes are in place.

tekBlues
A: 

I'd make a few changes in the DDL:

CREATE TABLE [Tags](
 [TagID] [bigint] IDENTITY(1,1) NOT NULL,
 [Tag] [nvarchar](100) NOT NULL,
PRIMARY KEY CLUSTERED 
(
 [TagID] ASC
),
 CONSTRAINT [UC_Tags] UNIQUE NONCLUSTERED 
(
 [Tag] ASC
)
)

GO

CREATE TABLE [Videos](
 [VideoID] [bigint] IDENTITY(1,1) NOT NULL,
 [Title] [nvarchar](100) NOT NULL,
 [isActive] [bit] NOT NULL,
PRIMARY KEY CLUSTERED 
(
 [VideoID] ASC
),
 CONSTRAINT [UC_Videos] UNIQUE NONCLUSTERED 
(
 [Title] ASC
)
)

GO

CREATE TABLE [VideoTags](
 [VideoID] [bigint] NOT NULL,
 [TagID] [bigint] NOT NULL,
PRIMARY KEY CLUSTERED 
(
 [VideoID] ASC,
 [TagID] ASC
)
)

GO

ALTER TABLE [VideoTags]  WITH CHECK ADD FOREIGN KEY([TagID])
REFERENCES [Tags] ([TagID])
GO

ALTER TABLE [VideoTags]  WITH CHECK ADD FOREIGN KEY([VideoID])
REFERENCES [Videos] ([VideoID])
GO
  1. I'd make the text columns nvarchar. Makes it easier to track "foreign" movies.
  2. I'd make the id columns IDENTITY columns and make them the primary keys
  3. I'd designate the foreign keys
  4. I'd make the Tag and Title columns unique. You don't want duplicate titles or tags
  5. I'd make all of these columns non-nullable. It makes no sense to have a video or tag with an unknown name, and a video is active nor inactive, never "maybe" or "unknown".
  6. I added a primary key to VideoTags to prevent duplication.

For a SQL query, I'd try the following. I can't be sure it's what you want without test data:

;
WITH VIDEO_TAG_COUNTS(VideoID,TagCount)
AS
(
    SELECT v.VideoID, COUNT(*)
    FROM Videos V
    INNER JOIN VideoTags VT ON V.VideoID = VT.VideoID
    GROUP BY V.VideoID
)
SELECT V.VideoID, V.Title
FROM Videos V 
INNER JOIN VIDEO_TAG_COUNTS VTC ON V.VideoID = VTC.VideoID
WHERE V.isActive = 1
ORDER BY VTC.TagCount
John Saunders
what is VIDEO_TAG_COUNTS?
Marc V
That's a Common Table Expression (CTE). Standard SQL Construct available in SQL Server 2005 and above. Not strictly necessary in this case, but I like using them to build up a query. Besides, the more I use them, the less likely I am to forget the syntax.
John Saunders
+1  A: 

Here's the sql

SELECT v.VideoID, v.Title, v.isActive
FROM Videos v
  JOIN 
(
  SELECT vt.VideoID, Count(*) as MatchCount
  FROM VideoTags vt
  WHERE vt.TagID in
  (
    SELECT TagID
    FROM Tags t
    WHERE t.Tag in ('horror', 'scifi')
  )
  GROUP BY vt.VideoID
) as sub
  ON v.VideoID = sub.VideoID
ORDER BY sub.MatchCount desc


And here's the Linq.

List<string> TagList = new List<string>() {"horror", "scifi"};

  //find tag ids.
var tagQuery =
  from t in db.Tags
  where TagList.Contains(t.Tag))
  select t.TagID

  //find matching video ids, count matches for each
var videoTagQuery =
  from vt in db.VideoTags
  where tagQuery.Contains(vt.TagID)
  group vt by vt.VideoID into g
  select new { VideoID = g.Key, matchCount = g.Count;

  //fetch videos where matches were found
  //ordered by the number of matches
var videoQuery =
  from v in db.Videos
  join x in videoTagQuery on v.VideoID equals x.VideoID
  orderby x.matchCount
  select v
  //hit the database and pull back the results
List<Video> result = videoQuery.ToList();


Oh wait - you don't have a taglist, you have a video and want videos with similiar tags. Ok:

SELECT v.VideoID, v.Title, v.isActive
FROM Videos v
  JOIN 
(
  SELECT vt.VideoID, Count(*) as MatchCount
  FROM VideoTags vt
  WHERE vt.TagID in
  (
    SELECT TagID
    FROM VideoTags vt2
    WHERE vt2.VideoID = @VideoID
  )
  GROUP BY vt.VideoID
) as sub
  ON v.VideoID = sub.VideoID
ORDER BY sub.MatchCount desc

And the Linq is the same except tag query changes

int myVideoID = 4

  //find tag ids.
var tagQuery =
  from t in db.VideoTags
  where t.VideoID = myVideoID
  select t.TagID
David B
Thanks a Lot, it helped me a lot.
Marc V