views:

750

answers:

4

I have a database of 20 million users and connections between those people. How can I prove the concept of "Six degrees of separation" concept in the most efficient way in programming?

link to the article about Six degrees of separation

+10  A: 

You just want to measure the diameter of the graph. This is exactly the metric to find out the seperation between the most-distantly-connected nodes in a graph.

Lots of algorithms on Google, Boost graph too.

SPWorley
Is six degress a maximum or an average? Most of the actual analysis I've read about uses the average not the maximum.
Kathy Van Stone
The common conception of "six degrees of separation" is that it is a maximum. Of course, that's not true at all in reality. It's just more impressive to say it that way and hard to find counterexamples.
Tyler McHenry
+3  A: 

You can probably fit the graph in memory (in the representation that each vertex knows a list of its neighbors).

Then, from each vertex n, you can run a breadth-first search (using a queue) to the depth of 6 and count number of vertices visited. If not all vertices are visited, you have disproved the theorem. In other case, continue with next vertex n.

This is O(N*(N + #edges)) = N*(N + N*100) = 100N^2, if user has 100 connections on average, Which is not ideal for N=20 million. I wonder if the mentioned libraries can compute the diameter in better time complexity (general algorithm is O(N^3)).

The computations for individual vertices are independent, so they could be done in parallel.

A little heuristic: start with vertices that have the lowest degree (better chance to disprove the theorem).

Martin Konicek
I think this is considerably worse than O(n^2). even assuming each node is connected to only 3 other nodes, a stack trace of depth 6 would be 3 * 2^0 + 3 * 2^1 + 3 * 2^2 + 3* 2^3 + 3* 2^4 + 3*2^5. Exponential growth
patros
For each vertex you visit each vertex at most once, so the run for one vertex takes O(N).
Martin Konicek
Ah, true, that is a limit on it. I think this is still O(N^3), then isn't it? Finding a path from vertex A to vertex B is O(N), and you must do this O(N^2) times.
patros
Hi, imagine you want to compute shortest distance from one vertex to every other vertex in the graph. Since the edges don't have weights (length of path = # of edges), this can be done in O(N+#edges): run a breadth first search - http://en.wikipedia.org/wiki/Breadth-first_search. You are right, it's not O(N) but O(N+#edges), where #edges is like 100*N (if user has 100 connections on average). Thanks
Martin Konicek
+2  A: 

I think the most efficient way (worst case) is almost N^3. Build an adjacency matrix, and then take that matrix ^2, ^3, ^4, ^5 and ^6. Look for any entries in the graph that are 0 for matrix through matrix^6.

Heuristically you can try to single out subgraphs ( large clumps of people who are only connected to other clumps by a relatively small number of "bridge" nodes ) but there's absolutely no guarantee you'll have any.

patros
You can't build an adjacency matrix of size 20x20 million in memory. Also, the multiplication would be O(N^3), where N is 20 million.
Martin Konicek
It's roughly n^2.8 with strassen's algorithm, since they're square matrices. you also don't need to keep the whole matix in memory, only the parts you're actively multiplying. Page the rest out to disk.
patros
Does require a lot of disk though... 400 TB for a naive approach. Lots of room for compression though.
patros
+2  A: 

Well a better answer has already been given, but off the top of my head I would have gone with the Floyd-Warshall all pairs shortest path algorithm, which is O(n^3). I'm unsure of the complexity of the graph diameter algorithm, but it "sounds" like this would also be O(n^3). I'd like clarification on this if anyone knows.

On a side note, do you really have such a database? Scary.

Dan Olson