tags:

views:

17

answers:

1

I am trying to build a recommender system based on user clicks. so far, i am not very concerned with all the nodes. i just wanted a way to develop a mechanism to detect that when a user clicked on A, he clicked on B. and when the user clicks on B, he clicks on C. so basically, in this node:

a > b > c

Its not very important for me to be able to tell that clickers on A will most likely click on c. as long as I know that clicker on A clicked on B, and clicker on B clicked on C, and so on, I am jolly.

On the preface, we collect logs in the form of IIS which has viewstateuserkey (guid) as unique identifier. So basically, if 2 viewstateusers click on a and b, the result should return 2.

Here is what my table looks like currently.

CREATE TABLE [dbo].[ratingdeveloper]( [itemid1] [int] NULL, [itemid2] [int] NULL, [count] [int] NULL, [sum] [int] NULL, [userid] varchar NULL ) ON [PRIMARY]

GO

and a sample query looks like this:

SET NOCOUNT ON

SELECT a.itemid1 , b.itemid2 , sqrt(count(a.userid)) as occurance from ratingdeveloper as a WITH (NOLOCK) inner join ratingdeveloper as b WITH (NOLOCK) on a.itemid1 = b.itemid1 and a.itemid2 = b.itemid2 group by a.itemid1 , b.itemid2
HAVING a.itemid1 = 1 order by a.itemid1 , b.itemid2 GO

My Questions are:

1) Does this look good enough? Any better ideas? 2) Since IIS logs have viewstateuser keys, and click times, how do I loop through the log file, and import every instance of clicks on respective places. For instance, if viewstateuser(1) clicks on A, B, C

I want the data inserted in the form of:

ITEMID1 = A, ItemID2 = B ITEMID1 = B, ITEMID2 = c

and so on.

Any suggestion would be appreciated.

Thanks,

-Subhash

A: 

It's going to be difficult to give you a good answer without knowing your performance constraints. However, I think you might be better served by having a table to keep track of the raw clicks; I'm not really sure how you're planning to use the [sum] and [count] fields.

I might go with something like:

CREATE TABLE [dbo].[ratingdeveloper] (
    ItemId1 INT NOT NULL FOREIGN KEY REFERENCES Items (ItemId), -- (why would you ever want a null?)
    ItemId2 INT NOT NULL FOREIGN KEY REFERENCES Items (ItemId),
    UserId UNIQUEIDENTIFIER NOT NULL)

You may want indexes on ItemId1 and on UserId, depending on your performance constraints.

I'm not sure what you are trying to accomplish by taking the square root in your query, either. If you are trying to make clicks from the same user "worth" less than clicks from users in distinct sessions (because you actually don't know in this model whether different sessions are different users), then you might try something like:

SELECT a.ItemId2, SUM(SQRT(a.Item2Count)) as rating
FROM
(SELECT rd.UserId, rd.ItemId2, COUNT(rd.ItemId2) AS Item2Count
 FROM   ratingdeveloper rd
 WHERE  rd.ItemId1 = @ItemId1
 GROUP BY rd.UserId, rd.ItemId2) a
GROUP BY a.ItemId2
ORDER BY rating DESC

Of course, @ItemId1 is the starting item, and you'll want that to be a parameter in a stored procedure if you're just using plain old SQL (without something like Linq or Entity Framework).

As far as importing the info from the log, you'll want to parse out the user id, referer, and new URL per entry, match those up with the items in the database, and insert the information into the ratingdeveloper table. I don't know what language you would be wanting to use for that or generally how you would want to run such an application. You could, alternatively, keep track of the stats "your way" in real time as the user clicks on links in the site. Your web framework will make a difference in the best way to go about this, but it could be an override of the Page_Init in ASP.NET, for example, that inserts the stats into the table.

Andrew