views:

85

answers:

2

Hi,

I don't know whether the term sort is most appropriate but anyway I have a large datasets which consists of columns userA, userB, date, interactionDuration. In other words the dataset contains rows which describe which users were interacting, how long the interaction lasted and the date of interaction.

The goal is to divide these interaction into relevant and irrelevant or important and unimportant and by the following logic: if the interaction is long we assume it is important and if it is short it is accidental and "irrelevant". On the other hand if same two persons interact often (every day or every week) and event if the single interaction doesn't last long we assume that it is important because it is not accidental.

The data is stored in MS SQL Server 2008 and I am wondering whether it is possible to somehow calculate the interval (in days) between the interactions of distinct pairs of people by using SQL queries or it is necessary to write custom routine.

Anyway I would be very thankful to hear any suggestion about how to perform sorting that divides the pairs into groups by the number of interactions, interaction durations and the intervals between the interactions - is it possible to calculate timespans between two consecutive records (sorted by date) and then select the group of pairs which interacted commonly and/or for a long time by using solely SQL queries or is it neccesary to write custom algorithm?

edit1

These are example rows:

user_A                                      user_B                                      interaction_duration    interaction_date_date
00002781A2ADA816CDB0D138146BD63323CCDAB2    0E3F0A58C6BD5202D1B0D4D4F6A0B05E7A156AC7    532                     2010-09-04
00002781A2ADA816CDB0D138146BD63323CCDAB2    47DFCA60C7D2D9B32ACBE650BCBA53F3DA57CB7B    29                      2010-09-15
00002781A2ADA816CDB0D138146BD63323CCDAB2    8F4ADF3CC26905AF69F930C546954D87DB536E24    83                      2010-09-04
00002781A2ADA816CDB0D138146BD63323CCDAB2    8F4ADF3CC26905AF69F930C546954D87DB536E24    10                      2010-09-01
00002D2354C7080C0868CB0E18C46157CA9F0FD4    372BD190F61BAACDFA2512B71B0C62BCD654D5D1    50                      2010-09-23
00002D2354C7080C0868CB0E18C46157CA9F0FD4    372BD190F61BAACDFA2512B71B0C62BCD654D5D1    105                     2010-09-18
00002D2354C7080C0868CB0E18C46157CA9F0FD4    372BD190F61BAACDFA2512B71B0C62BCD654D5D1    44                      2010-09-08
00002D2354C7080C0868CB0E18C46157CA9F0FD4    372BD190F61BAACDFA2512B71B0C62BCD654D5D1    50                      2010-09-04
00002D2354C7080C0868CB0E18C46157CA9F0FD4    372BD190F61BAACDFA2512B71B0C62BCD654D5D1    21                      2010-09-02

What is the best way to sort the above records by frequency (rarity) of interactions? Is it possible to sort it by using sql queries?

Thank you!

A: 

I suggest using analytical functions within a sub-query to determine date intervals, then sort by the analytical function results in the outer query.

Mark Bannister
+1  A: 

If I understand you correctly, given this data (first 10 rows):

User1   User2   Stamp   Duration (secs)
d   g   09/01/2010  20
b   g   13/01/2010  37
c   k   29/01/2010  15
c   k   12/01/2010  39
e   h   27/01/2010  33
b   i   23/01/2010  46
b   g   04/01/2010  44
b   k   05/01/2010  12
b   k   02/01/2010  9

You would like to know which users interacted often or for a long time in a given period. I assume you're using MS SQL Server 2005 or later.

More than 10 conversations in a period ranging from 3 to 7 days:

WITH Contact(u1, u2, dt) AS (
    SELECT DISTINCT
        user1, user2, stamp
    FROM
        Interactions
)
SELECT
    u1, u2, dt as FirstConversation, MAX(stamp) as LastConversation,  
    COUNT(Duration) AS conversations
FROM
    Interactions INNER JOIN Contact
    ON user1 = u1 AND User2 = u2 AND Stamp >= dt
GROUP BY
    u1, u2, dt
HAVING
    COUNT(duration) > 10 AND 
    DATEDIFF("d", MIN(stamp), MAX(stamp)) BETWEEN 3 and 7
ORDER BY
    1, 2, 3

The WITH statement calculates, for each pair of users, the dates of their interactions, ignoring multiple interactions during the day. This is our baseline for finding conversations AFTER that date which meet the conditions we set. This is expressed by the "Stamp >= dt", which means "show the contacts which have interactions after the contact".

More than 2 minutes in 4 days:

WITH Contact(u1, u2, dt)AS (
    SELECT DISTINCT
        user1, user2, stamp
    FROM
        Interactions
)
SELECT
    u1, u2, dt as FirstConversation, MAX(stamp) as LastConversation,  
    SUM(Duration) AS timespent
FROM
    Interactions INNER JOIN Contact
    ON user1 = u1 AND User2 = u2 AND Stamp >= dt
GROUP BY
    u1, u2, dt
HAVING
    SUM(Duration) > 120 AND 
    DATEDIFF("d", MIN(stamp), MAX(stamp)) BETWEEN 0 and 4
ORDER BY
    1, 2, 3

HTH?

smirkingman