views:

27

answers:

1

Given a large scale-free graph (a social network graph), what's the best way to sample it such that the sample retains an acceptable abstraction of the properties of the original?

I have a large graph (Munmun's twitter dataset, if you know it). But I need a connected sample of that graph with a reasonably large diameter (tl;dr... reasons why on request... a diameter of 10 would be good).

The problem is that any kinda breadth-first search always is likely to come across some massively connected nodes. So I start such a search, getting the friends of all nodes which I come across. I inevitably come across some massively-connected nodes, and have to get all their friends. This is a problem because I end up with a large number of nodes which are close to each other in the graph. To make programmatic analysis feasible, I have to limit the number of nodes (and edges). The whole point of this exercise is to find shortest paths between nodes, so I'm generally interested in ALL of a node's neighbours. And that's the problem.

One hack around this is to limit the max. number of nodes connected to a user which I'm interested in. For instance, if I come across @barackobama in my breadth-first search, I ensure that I only accept some small proportion of his friends and ignore the rest. But would this hacked graph be worth a damn, or am I losing too much information in terms of finding shortest paths??

Hope that makes sense...

A: 

I am not sure, if I understand your question correctly. I think the main question you have is, about how you can compute the shortest path of two nodes in a giant, directed graph. Creating a subsample of the graph seems to be your attempt to create an efficient solution. (But I probably misunderstood you completely.)

Perhaps this SO-Question has some pointers for you: Efficiently finding the shortest path in large graphs

The graphs in that question seem to be significantly smaller, though.

bjoernz
Thanks... the info on that page is useful...
John Conroy