ansaurus

Question

How do sites like LinkedIn efficiently display 1st/2nd/3rd-level relationship next to each person's name?

Answer 1

+1 A:

If you think about it, doing this in SQL could be very processor intensive.

Given that and the fact that it will ultimately be used all over the place, and that space is relatively cheap...I would suggest creating an index using Lucene (or Lucene.NET) depending on your language preference. You could do a couple things this way.

You can either create a tree type data structure and recursively crawl your index looking for all the parent nodes or child nodes and their parent or child nodes depending on your needs at the time.

Or you could write out all the relationships as they are created (the space is cheap concept). This would be a write once process (which you wouldn't be updating all that often any ways). When a relationship is created or revoked you would queue an update to your index (queue because you wouldn't want to open for write for single requests...batch the index updates). Then you could read this really flat structure to get the IDs in question.

With the IDs in hand (from which ever search type you perform) you can then go to the DB to get the surrounding required information. Then cache your output to further minimize what would be a very fast search, db query, data building...but faster still if it just comes from cache.

Use something like Velocity, MemCached, or MemCached Win32 for your centralized caching across a web farm.

Andrew Siemer 2009-10-12 19:54:07

Hmmm. My thought was that it'd be smarter to separate search (which can act the same for all users) from finding relationship info (which is always relative to me). So you'd execute a search, get back the list of IDs in that search, and then attach relationship distance. Was that what you're suggesting too, or something else? Also, although memcached (or velocity, or...) would work for per-user caches (e.g. cache 2nd-level connections after login), if I wanted to cache the entire network, I'd need to fetch many (e.g. 1000) records in a batch with 1 remote call. Is memcached good at that?

Justin Grant 2009-10-13 18:25:35

Answer 2

+4 A:

Interestingly, 1970's technology would do a fair job of modeling this. The Network Database Model efficiently manages this type of relationship.

It's not efficient in terms of ad hoc queries or data model maintenance, so fell out of favor with the rise of relational data models.

Eric J. 2009-10-12 20:01:50

Answer 3

+2 A:

You may be able to leverage axioms about small world networks to optimize this type of traversal.

Small world networks are characterized by "hubs" the represent very dense interconnections of other nodes. Most nodes in the network will generally either connect within a few hops to a topologically nearby node (1-4 hops away) or will route through one or more such hubs. This is one of the main reasons that small world networks behave the way they do.

LBushkin 2009-10-12 20:09:41

Yep, LinkedIn (and any social networking site) definitely has networks which act like this. How do you think I could apply these affects to create a solution that's better (easier to build and/or faster to execute) than the simplistic "persistenly cache everyone's first-level connections, and cache second-level connections on login" solution I came up with above?

Justin Grant 2009-10-13 18:18:46

Conceptually, you're trying to find the shortest path between two nodes in a small world network (SWN). A naive algorithm starts at one of the target nodes and expands outwards looking at children, grandchildren, great-grandchildren, until you find the other target. Computationally it's O(c^n) where c = average number of children for a node. Using the properties of SWN, first identify which nodes are hubs, identify the number of hops between all hubs n*(n-1), order the hubs by size, and then first look for both target nodes as children of this set of hubs.

LBushkin 2009-10-13 18:46:33

If none of the hubs contains a child, you would expand outward looking at the immediate children of a hub for either target node. Such an algorithm may not find the *shortest* distance between two nodes, but it will find a relatively short one if it exists through a hub. Furthermore, you can build some caching structures around the hubs in such a network to optimize lookups for immediate children and grandchildren.

LBushkin 2009-10-13 18:48:15

Very interesting, and nice computational details. In LinkedIn's case, they show all matching paths of a given length, not just the first one encountered at a particular depth-- so I think your traversal algorithm won't help to reduce total time/CPU needed. But I like your idea about using "hubness" (or even using # of connections as a proxy) as a way to prioritize what to cache, instead of simply using LRU caching. For this useful insight, I'll accept your answer!

Justin Grant 2009-10-16 17:53:07

Answer 4

A:

I'm not sure of the table structure, or complexity of the system, but here is a simple SQL Server example using a recursive CTE:

DECLARE @People table (PersonID int, Name varchar(10))
DECLARE @Network table (PersonID int, NetworkedPersonID int)
INSERT INTO @People VALUES (1,'AAA')
INSERT INTO @People VALUES (2,'BBB')
INSERT INTO @People VALUES (3,'CCC')
INSERT INTO @People VALUES (4,'DDD')
INSERT INTO @People VALUES (5,'EEE')
INSERT INTO @People VALUES (6,'FFF')
INSERT INTO @People VALUES (7,'GGG')
INSERT INTO @People VALUES (8,'HHH')
INSERT INTO @Network VALUES (1,2)
INSERT INTO @Network VALUES (1,3)
INSERT INTO @Network VALUES (2,5)
INSERT INTO @Network VALUES (2,7)
INSERT INTO @Network VALUES (4,8)
INSERT INTO @Network VALUES (7,8)
INSERT INTO @Network VALUES (7,3)
INSERT INTO @Network VALUES (8,9)
DECLARE @TargetPersonID  int
SET @TargetPersonID=1

;WITH NetworkLevels AS
(   SELECT
        NetworkedPersonID,1 AS NetworkLevel
        FROM @Network
        WHERE PersonID=@TargetPersonID
    UNION ALL
    SELECT
        n.NetworkedPersonID, l.NetworkLevel+1
        FROM @Network                n
            INNER JOIN NetworkLevels l ON n.PersonID=l.NetworkedPersonID
    WHERE l.NetworkLevel<=2
)
SELECT * FROM NetworkLevels

OUTPUT:

NetworkedPersonID NetworkLevel
----------------- ------------
2                 1
3                 1
5                 2
7                 2
8                 3
3                 3

(6 row(s) affected)

KM 2009-10-12 20:20:58

This solution will almost certainly be I/O bound at real-world scale (e.g. 50M members, 100 connections per user, 50+ queries/second). If you only go 2 levels deep, that's 10K I/Os per second, if you go 3 levels deep that's 1M I/O's per second. Relational databases (and especially those backed by non-SSD disk-based storage) aren't well suited to handle I/O rates that high. A custom solution which cached the connections in RAM could probably get 100x the throughput and use half the RAM, which (at the scale of a service like linkedin) would probably save $1M+ in hardware.

Justin Grant 2009-10-12 23:02:40

@Justin Grant, gee you don't think my 15 line query would replace the custom solution developed (with considerable effort over several years, and with some serious cash no doubt) by linkedin! I don't either, it was just a boring day ;-) I thought this post might help someone if they were doing a much lower volume application of the 3 level relationship. perhaps in a week, six month, or a year, someone will google this page and this simple code might help them. Not every web app has a completely custom data storage system, the available resources, or the high volume of linkedin.

KM 2009-10-13 13:38:21

agreed. my fault for not being clear enough that the essence of the question was not solving the problem (which your solution certainly does!) but solving the problem while dealing with the perf/scale/cost tradeoffs that accompany very large (10M-100M+) social networks.

Justin Grant 2009-10-13 18:28:52

ansaurus

tags:

views:

answers:

How do sites like LinkedIn efficiently display 1st/2nd/3rd-level relationship next to each person's name?

related questions