Hello StackOverflow, my new friends!
I have a challenge I'm facing in my social networking app. The goal is to calculate the popularity of a user's "peers" - user's friends' friends.
I do this today no problem, but it's one of the more time/processor intensive operations, and I'm embarking on a new phase of business development which will likely spike this type of activity, so I really need to optimize the process.
Simplifying to the essentials, here's how it's done today.
For user X, I clear out a "summary" table which has friend_id
and connections_count
summary
-------
friend_id
connection_count
NOTE: there's no index on the friend_id
field of this table right now. I was worried about reindexing. Could be fertile territory for exploration.
For some users, this table can quickly get to be millions of rows... probably under 10 million though.
I could just loop through the user's friends, get /their/ friends, and add/increment rows in this summary table as necessary, but I have a feeling that would be very slow (one connection per increment, indexing, etc.)
Instead, for each of user X's friends, I populate a temporary 'connections' table, nearly identical to the summary table:
connections
-----------
friend_id
connection_count
I DO have an index on this friend_id
field. Maybe I don't need one here, but need one on the other table... dunno.
I then collapse the connections table into the summary table before I move on to the next friend. Note that connection_count
is never more than 1 if I collapse for each user and then truncate before starting the next (one user is never friends with another more than once).
The "collapse" query is where I think I may be running into trouble. Looks something like this:
INSERT INTO summary (friend_id, connection_count)
SELECT connections.friend_id, 1
FROM connections
ON DUPLICATE KEY UPDATE
connection_count=connection_count + 1
As I said, connection_count
in the 'connections' table is never more than 1 if I don't do multiple users at once... this is why I can insert 1 and add 1 in the collapse query, above, instead of selecting out a value from 'connections'.
So... if I don't do one user at a time, I could get more efficient in my collapse query by virtue of having to call it less frequently, however, while constructing my 'connections' table, I'd need to do another insert... on duplicate key thing, like I do in my collapse query, and I have a feeling that's not ideal.
I don't think it'd be good to forgo the intermediate step of the 'connections' table, but I'm not a database expert, so I could be wrong there.
And then, maybe (probably) there's an entirely different approach which would be far more efficient than this one.
Any ideas?? I've thought about trying this purely in code but... ugh.