views:

1377

answers:

11

UserA-UserB-UserC-UserD-UserF

Users connected by '-' know each other.

And I need an algorithm for these 2 tasks:

  1. Calculate the path from UserX to UserY
  2. For UserX,calculate all users that is no more than 3 steps away.

Is there an efficient solution?

EDIT

My purpose is not to prove it right or wrong,but to calculate the result real time when necessary.

Plus,I think the most expressive way is code,even pseudo ones.

EDIT AGAIN

I've decided that this kind of job must be done inside database,so it must be a sql solution!

+1  A: 

Google it and you'll find a lot.

I doubt you can find a pseudo code (at least I've yet to). Here are some of the interesting reads:

"Six Degrees of Separation" Explained In New Computer Algorithm
CU computer scientist helps explain how 'six degrees of separation' works
SO - How can I prove the “Six Degrees of Separation” concept programmatically?

o.k.w
+7  A: 

Graph algorithms can help you here. Learning about them is fun too!

  • Graph connectivity for the connectivity.
  • Dijkstra (A*) for paths between users
  • Simple DFS for finding all users N nodes away from your user

Dijkstra or similar should be used if you want to find the shortest path between two users. There may be several paths between them, and Dijkstra takes care of noting when it found a shorted path than another that was found before.

To find the shortest paths between all users you'll have to use something like Floyd-Warshall. It is a nice example of dynamic programming and is quite simple to implement. The pseudocode from Wikipedia is:

 1 /* Assume a function edgeCost(i,j) which returns the cost of the edge from i to j
 2    (infinity if there is none).
 3    Also assume that n is the number of vertices and edgeCost(i,i) = 0
 4 */
 5
 6 int path[][];
 7 /* A 2-dimensional matrix. At each step in the algorithm, path[i][j] is the shortest path
 8    from i to j using intermediate vertices (1..k−1).  Each path[i][j] is initialized to
 9    edgeCost(i,j) or infinity if there is no edge between i and j.
10 */
11
12 procedure FloydWarshall ()
13    for k := 1 to n
14       for i := 1 to n
15          for j := 1 to n
16             path[i][j] = min ( path[i][j], path[i][k]+path[k][j] );
Eli Bendersky
Master,can you elaborate it with some simple enough code?
without knowing your background it's hard to tell
Eli Bendersky
Dijkstra's algorithm boils down to breadth-first search in this case, right? I think it'll be slow if you have lots of users.
Jason Orendorff
Yes,Dijkstra's algorithm applies for weighted diagram
To use A*, you have to have a heuristic of some sort, which would be extremely interesting.
Travis Gockel
That would be interesting - it would have to take the form of a figure indicating how likely it is for one member to know another member. Depending on the data you have available it is certainly possible - Facebook does this to suggest friends you might not have known were members.
Erik Forbes
+2  A: 

I had a look at this a while ago and could not come up with an efficient solution for a web application.

I ended up with 5 levels rather than six

See here for my google group post there is a SQL and a C# solution.

Note: that you should google for "Dijkstra's algorithm" as it is known to be a good algorithm to find a shortest path.

Edit: Try this link

BTW The CLR method executed the fastest.

Rippo
Yes,I've heard about that algorithm.But seems it's not fit here.Because it only finds **one** shortest path.
Why does it not fit? This finds ALL paths then filters the shortest. It can be adapted for your needs quite simply
Rippo
I can't click through that link..
Rippo
Sorry,it reports "the connection was reset.."
Rippo
Its the post posted on Dec 2, 2006
Rippo
Can you paste the sql solution here?I still can't click through to it.
Email me at info at rippo dot co dot uk and I will post you links.
Rippo
+2  A: 

The first question can be solved using dijkstra's algorithm. The second on using a DFS algorithm. This has already been said by the other guys, just wanted to point out, that the most efficient solution for both problems is not available in one algorithm.

pseudocode can be found at:

[Wikipedia][1]

for dijkstra and one in python for DFS at:

http://en.wikipedia.org/wiki/Depth-first_search

ruedi
+2  A: 

For task 2, you're not going to do better than breadth-first search, except maybe by caching.

For task 1, apply your solution for task 2. Find all users no more than 3 hops away from user X. Of course if user Y is in that set, you're done. If not, do a breadth-first search starting from user Y, and stop as soon as you reach any user that you already know is reachable from X.

(If you cache a little information about how you reached each user during task 2, then it will be easy to reconstruct the exact path when you find a link in task 1.)

Jason Orendorff
You can get a little smarter for task 1: put X and Y in separate hash sets; take turns expanding X's network and Y's network until you find a user they have in common.
Jason Orendorff
+3  A: 

The following scripts are written in sybase sql. You might have to go through minor modifications according to your db server.

Problem 1.

create table #connections (
    my_user  varchar(10)  not null  ,
    knows varchar(10)  not null  ,
        CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)   
) 

create table #traversed (id varchar(10) primary key)

insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')

DECLARE @str_sql   varchar(200)               
DECLARE @str_order varchar(60)

declare @start varchar(10)
set @start = ('UserD')
declare @end varchar(10)
set @end = ('UserA')

if (@start >= @end)
    set @str_order = " order by id desc"
else
    set @str_order = " order by id asc"


INSERT INTO #traversed VALUES (@start)

WHILE (select count(*) from #traversed where id = @end) = 0    
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows between (select case when @start < @end then @start else @end end)  
      and (select case when @start < @end then @end  else @start end) 
END

set @str_sql = "SELECT #traversed.id FROM #traversed" + @str_order 
exec (@str_sql)

Problem 2.

create table #connections (
    my_user  varchar(10)  not null  ,
    knows varchar(10)  not null  ,
        CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)   
) 

create table #traversed (id varchar(10) primary key)

insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')

declare @start varchar(10)
set @start = ('UserB')

declare @higher_counter int
declare @lower_counter int

set @higher_counter = 0
set @lower_counter = 0

INSERT INTO #traversed VALUES (@start)

WHILE (@higher_counter < 3)
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows > @start 

  set @higher_counter = @higher_counter +1
END  

WHILE (@lower_counter < 3)
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows < @start 

  set @lower_counter = @lower_counter +1
END   

SELECT #traversed.id FROM #traversed
gd047
+7  A: 

Represent this list of users by a graph

  • Each user is a node
  • There is an edge between any two users who know each other
  1. Calculate the path from UserX to UserY
  2. For UserX,calculate all users that is no more than 3 steps away.

These questions are so closely related that the same algorithm actually solves both of them. You can call it "Dijkstra's algorithm with all edges having weight 1," or "breadth-first search."

Essentially, starting at the first node, visit all its relatives; then mark them all as visited, record the shortest path to each (the shortest path to them + the edge you just traversed), and repeat for each of them. Stop after you've reached your destination for Problem #1, stop after the shortest path is > 3 for Problem #2.

This will run in O(n) time. No, there is no faster way.

The fastest O(n) algorithm for six-degrees of separation would probably be finding the sets of all users 1-step away from UserX and UserY, and finding the intersection of those two sets. If none, then add the users 2-steps from UserX and intersect; then add users 2-steps from UserY and intersect; etc. up to 3.

If each person has an average of 100 friends, this could require looking at up to about 2,020,200 users, as opposed to the 1,010 billion for Dijkstra's algorithm. In practice, these numbers would be much smaller, since often two of your friends are also friends with each other.

This is the only method of solving six-degrees of separation (of those mentioned so far) that will work in practice.

BlueRaja - Danny Pflughoeft
What do you think of my current solution for this problem:http://stackoverflow.com/questions/2101718/time-complexity-mysql-performance-analysis
@unknown: See the comments on that thread.
BlueRaja - Danny Pflughoeft
+3  A: 

I have a suggestion which is quite different from those you already have. If you have to stick with an SQL database and you don't know any java this suggestion won't be of much use.

You problem is specifically a graph problem, so I would suggest that while using an SQL database to store the graph will work, a different approach would be to use a solution that is intended specifically for graph problems.

The Neo4j project provides a disk-based graph database, together will lots of graph algorithms to work with it. Quote:

Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.

An appropriate example of using Neo4j on their wiki demonstrates a degrees-of-separation web application using IMDB data. The example illustrates shortest-path calculations between any actor and Kevin Bacon.

I like this example since it talks a lot about modeling the domain your graph will represent. Modeling your domain ensures you have thought about things like:

  1. Directed vs Undirected
  2. Edge types/relationships
  3. Attributes such as edge weights

As has been mentioned in other posts, there are a number of algorithms for calculating shortest-paths, such as Dijkstra, Floyd Warshall or BFS. All these have been implemented in Neo4j and some examples are provided here.

Binary Nerd
+3  A: 

Assuming source data is in a table: Connections:(PersonID, KnowsPersonID)

1) This will have to use a breadth first approach. Your potential for good performance is limited due to the exponential nature of the problem (although this exponential nature is why theoretically you only need 6 degrees :D). Make sure you limit the depth of your search. Whichever flavour of SQL you choose you'll probably be better off using its iterative extensions as opposed to a pure set based solution.

The following would be the basic approach using Microsoft's T-SQL:

CREATE PROCEDURE FindPath (@UserX int, @UserY int)

CREATE TABLE #SixDegrees(
  ConnectedUser int,
  Depth int,
  Path varchar(100),
  CONSTRAINT PK_SixDegrees PRIMARY KEY CLUSTERED (
    ConnectedUser
  )
)

DECLARE @Depth int,
        @PathFound varchar(100)
SET @Depth = 0

INSERT INTO #SixDegrees (@UserX, 0, CAST(@UserX as varchar))
/*Next line just in case X & Y are the same*/
SET @PathFound = (SELECT Path 
                  FROM #SixDegrees 
                  WHERE ConnectedUser = @UserY)

WHILE @Depth < 6 AND @PathFound IS NULL
BEGIN
  SET @Depth = @Depth + 1
  INSERT INTO #SixDegrees
  SELECT  k.KnowsPersonID, @Depth, (SELECT Path 
                                    FROM #SixDegrees 
                                    WHERE ConnectedUser = k.Link) + ',' + CAST(k.KnowsPersonID AS varchar)
  FROM (
      SELECT  MIN(ConnectedUser) Link, KnowsPersonID
      FROM    #SixDegrees
              JOIN Connections ON
                PersonID = ConnectedUser
      WHERE   Depth = @Depth
              /*EDIT: Added the following*/
              AND KnowsPersonID NOT IN (
                  SELECT  ConnectedUser
                  FROM    #SixDegrees
                  )
      GROUP BY KnowsPersonID
      ) k

  SET @PathFound = (SELECT Path 
                    FROM #SixDegrees 
                    WHERE ConnectedUser = @UserY)
END

IF @Path IS NULL
  PRINT 'No path found'
ELSE
  PRINT @Path
GO

EDIT: In the above solution I originally forgot to exclude users already in the #SixDegrees temp table.

2) A little tweak on the above to always loop to a depth of 3 would leave you with #SixDegrees containing all the Users you're interested in.

However, the following pure set based solution should be more efficient:

SELECT  DISTINCT KnowsPersonID
FROM    Connections
WHERE   PersonID IN (
    SELECT  DISTINCT KnowsPersonID
    FROM    Connections
    WHERE   PersonID IN (
        SELECT  KnowsPersonID
        FROM    Connections
        WHERE   PersonID = @UserX
        ) l1
    ) l2
Craig Young
+2  A: 

(This answer is equivalent to Djikstra's. It is basically an implementation detail.)

To answer #2, you can use boolean matrix multiplication to determine connectivity up to degree P.

Assuming you have a boolean matrix M where:

M(A, B)= A is directly connected to B

Then

(M(A, B))^P= A is connected to B within P links.

The matrix multiplication should use AND for multiply and OR for add:

You can optimize this a lot by only doing the multiplication for entries that were previously false as well as realizing that the matrix is symmetric. That is left as an exercise to the reader.

MSN
+1  A: 

I'm responding to the SQL solution only. This gives all the paths 3 steps away though it may not be "efficient" for large datasets. The tables "KNOW", "KNOW_1", etc. are all identical and have two fields P1 and P2. It has an entry only if 1) P1 knows P2 or 2) P2 knows P1. The data in P1 and P2 can be arbitrary strings corresponding to each person.

This Acccess SQL query should yield all the paths where a knows b knows c knows d without cycles (e.g a knows b knows c knows a). You still need to eliminate duplicates (abcd = dcba), but should be able to do it easily in a second step. The cycles are eliminated by preventing repeats of previous people in the Where statements.

SELECT Know.P1, Know.P2, Know_1.P2, Know_2.P2

FROM (Know INNER JOIN Know AS Know_1 ON Know.P2 = Know_1.P1)

INNER JOIN Know AS Know_2 ON Know_1.P2 = Know_2.P1

WHERE (((Know_1.P2)<>[Know].[P1]) AND ((Know_2.P2)<>[Know].[P1] And (Know_2.P2)<>[Know].[P2]))

ORDER BY Know.P1, Know.P2, Know_1.P2, Know_2.P2;

Not as elegant as the previous solutions, but it seems to work OK. We've had some experience with doing similar work with constraint programming and found that the SQL process is faster for certain problems.

Grembo