views:

933

answers:

4

I recently botched a job interview by poorly answering a straightforward question: how do sites like LinkedIn efficiently show the relationship distance (1st/2nd/3rd) from you to every person displayed on a page (e.g. in people search results, list of people working in a company, etc.)?

<EDIT> I got the essential "trick" of the solution: finding "distance from me" is a common operation (e.g. 20x+ on a single page, 100's per login session), so you can do part of the "distance of me to X", cache it, and then re-use that cached partial result many times in order to make other operations much cheaper. I also guessed that the partial result was likely to be my second-level connections, because "cache all 3rd-level connections" would be too costly in RAM and CPU.</EDIT>

But when trying to convert this insight into a solution, I came up with a bumbling answer involving creating persistent caches of 2nd-level connections of everyone on the site (which would have been hugely epensive in perf and complex to maintain), and I took an inexplicable detour into using Bloom Filters in an way that made little technical sense. I wouldn't have hired myself after an answer like that!

Later, as I thought about the problem without the pressure of an interview hanging over my head, I came up a more reasonable answer.

  • Build a very fast way to get the first-level connections for each of batch of user IDs (batch size up to ~1000?). This probably means a dedicated cluster of lots-of-RAM servers which can cache the entire network's 1st-level connections in memory. Luckily, 50M members x avg. 100 connections per member x 4 bytes per member ID = <25GB to cache in RAM, which is doable with reasonably-priced hardware. And the number of changes per day is going to be under 1%, so keeping the cache up-to-date is not too hard. (Note that a relational database would probably be a bad choice to implement this cache because the "lots of random I/O" access pattern kills relational DB performance.)

  • when a user logs in, cache his 2nd-level connections by fetching 1st-level connections of every 1st-level connections, and stick in a hashtable (key = 2nd-level ID, value = array of 1st-level connections which connect you). Also cache your first-level connections too so you can pull back both 1st- and 2nd-level via a single call back to your remote cache server. User IDs are easily partitionable, so a distributed cache like memcached may work well for this.

  • for any user ID, to find whether it's in your "network" and what relationship it is to you (1st, 2nd, 3rd), do the following:

    1. if the ID is in your first-level connections, stop.
    2. try looking up the ID in your cached 2nd-level connections hashtable. If found, return the array of connections which connect you.
    3. fetch the ID's first level connections, and repeat step #2 for each of them. Aggregate all results into a single array and return them.
    4. <EDIT> refactor into a batch implementation ("look up distance from me to N different users") so you can get all the remote results from step #3 without having to make up to N remote calls.</EDIT>

But I'm sure there are better answers to this. What's yours? If you want extra challenge, try simulating an inteview situation (can't look up solutions on the Web).

Note that the question was about an optimal solution, regardless of how LinkedIn actually does it today, which I looked up after I wrote my own answer above.

+1  A: 

If you think about it, doing this in SQL could be very processor intensive.

Given that and the fact that it will ultimately be used all over the place, and that space is relatively cheap...I would suggest creating an index using Lucene (or Lucene.NET) depending on your language preference. You could do a couple things this way.

You can either create a tree type data structure and recursively crawl your index looking for all the parent nodes or child nodes and their parent or child nodes depending on your needs at the time.

Or you could write out all the relationships as they are created (the space is cheap concept). This would be a write once process (which you wouldn't be updating all that often any ways). When a relationship is created or revoked you would queue an update to your index (queue because you wouldn't want to open for write for single requests...batch the index updates). Then you could read this really flat structure to get the IDs in question.

With the IDs in hand (from which ever search type you perform) you can then go to the DB to get the surrounding required information. Then cache your output to further minimize what would be a very fast search, db query, data building...but faster still if it just comes from cache.

Use something like Velocity, MemCached, or MemCached Win32 for your centralized caching across a web farm.

Andrew Siemer
Hmmm. My thought was that it'd be smarter to separate search (which can act the same for all users) from finding relationship info (which is always relative to me). So you'd execute a search, get back the list of IDs in that search, and then attach relationship distance. Was that what you're suggesting too, or something else? Also, although memcached (or velocity, or...) would work for per-user caches (e.g. cache 2nd-level connections after login), if I wanted to cache the entire network, I'd need to fetch many (e.g. 1000) records in a batch with 1 remote call. Is memcached good at that?
Justin Grant
+4  A: 

Interestingly, 1970's technology would do a fair job of modeling this. The Network Database Model efficiently manages this type of relationship.

It's not efficient in terms of ad hoc queries or data model maintenance, so fell out of favor with the rise of relational data models.

Eric J.
+2  A: 

You may be able to leverage axioms about small world networks to optimize this type of traversal.

Small world networks are characterized by "hubs" the represent very dense interconnections of other nodes. Most nodes in the network will generally either connect within a few hops to a topologically nearby node (1-4 hops away) or will route through one or more such hubs. This is one of the main reasons that small world networks behave the way they do.

LBushkin
Yep, LinkedIn (and any social networking site) definitely has networks which act like this. How do you think I could apply these affects to create a solution that's better (easier to build and/or faster to execute) than the simplistic "persistenly cache everyone's first-level connections, and cache second-level connections on login" solution I came up with above?
Justin Grant
Conceptually, you're trying to find the shortest path between two nodes in a small world network (SWN). A naive algorithm starts at one of the target nodes and expands outwards looking at children, grandchildren, great-grandchildren, until you find the other target. Computationally it's O(c^n) where c = average number of children for a node. Using the properties of SWN, first identify which nodes are hubs, identify the number of hops between all hubs n*(n-1), order the hubs by size, and then first look for both target nodes as children of this set of hubs.
LBushkin
If none of the hubs contains a child, you would expand outward looking at the immediate children of a hub for either target node. Such an algorithm may not find the *shortest* distance between two nodes, but it will find a relatively short one if it exists through a hub. Furthermore, you can build some caching structures around the hubs in such a network to optimize lookups for immediate children and grandchildren.
LBushkin
Very interesting, and nice computational details. In LinkedIn's case, they show all matching paths of a given length, not just the first one encountered at a particular depth-- so I think your traversal algorithm won't help to reduce total time/CPU needed. But I like your idea about using "hubness" (or even using # of connections as a proxy) as a way to prioritize what to cache, instead of simply using LRU caching. For this useful insight, I'll accept your answer!
Justin Grant
A: 

I'm not sure of the table structure, or complexity of the system, but here is a simple SQL Server example using a recursive CTE:

DECLARE @People table (PersonID int, Name varchar(10))
DECLARE @Network table (PersonID int, NetworkedPersonID int)
INSERT INTO @People VALUES (1,'AAA')
INSERT INTO @People VALUES (2,'BBB')
INSERT INTO @People VALUES (3,'CCC')
INSERT INTO @People VALUES (4,'DDD')
INSERT INTO @People VALUES (5,'EEE')
INSERT INTO @People VALUES (6,'FFF')
INSERT INTO @People VALUES (7,'GGG')
INSERT INTO @People VALUES (8,'HHH')
INSERT INTO @Network VALUES (1,2)
INSERT INTO @Network VALUES (1,3)
INSERT INTO @Network VALUES (2,5)
INSERT INTO @Network VALUES (2,7)
INSERT INTO @Network VALUES (4,8)
INSERT INTO @Network VALUES (7,8)
INSERT INTO @Network VALUES (7,3)
INSERT INTO @Network VALUES (8,9)
DECLARE @TargetPersonID  int
SET @TargetPersonID=1

;WITH NetworkLevels AS
(   SELECT
        NetworkedPersonID,1 AS NetworkLevel
        FROM @Network
        WHERE PersonID=@TargetPersonID
    UNION ALL
    SELECT
        n.NetworkedPersonID, l.NetworkLevel+1
        FROM @Network                n
            INNER JOIN NetworkLevels l ON n.PersonID=l.NetworkedPersonID
    WHERE l.NetworkLevel<=2
)
SELECT * FROM NetworkLevels

OUTPUT:

NetworkedPersonID NetworkLevel
----------------- ------------
2                 1
3                 1
5                 2
7                 2
8                 3
3                 3

(6 row(s) affected)
KM
This solution will almost certainly be I/O bound at real-world scale (e.g. 50M members, 100 connections per user, 50+ queries/second). If you only go 2 levels deep, that's 10K I/Os per second, if you go 3 levels deep that's 1M I/O's per second. Relational databases (and especially those backed by non-SSD disk-based storage) aren't well suited to handle I/O rates that high. A custom solution which cached the connections in RAM could probably get 100x the throughput and use half the RAM, which (at the scale of a service like linkedin) would probably save $1M+ in hardware.
Justin Grant
@Justin Grant, gee you don't think my 15 line query would replace the custom solution developed (with considerable effort over several years, and with some serious cash no doubt) by linkedin! I don't either, it was just a boring day ;-) I thought this post might help someone if they were doing a much lower volume application of the 3 level relationship. perhaps in a week, six month, or a year, someone will google this page and this simple code might help them. Not every web app has a completely custom data storage system, the available resources, or the high volume of linkedin.
KM
agreed. my fault for not being clear enough that the essence of the question was not solving the problem (which your solution certainly does!) but solving the problem while dealing with the perf/scale/cost tradeoffs that accompany very large (10M-100M+) social networks.
Justin Grant