views:

489

answers:

4

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

My current implementation :

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

Problems of current implementation :

  • Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.
  • Will add more as they get pointed out.

So, marginal improvments are welcome, as well as full rewrites. Thanks!

+2  A: 

I am really confused as to why it takes 10 seconds to add a node to the DB. That sounds like a problem. What database are you using? Do you have severe platform restrictions?

With modern systems, and their oodles of memory, I would recommend a nice simple cache of some kind. You should be able to create a very quick cache of user information that would allow you to avoid repeated work. When you have encountered a node already, stop processing. This will avoid cycling forever in cliques.

If you need to allow for rehashing existing nodes after a while, you can use a last_visit_number which would be a global value in the dB. If the node has that number, then this crawl is the one that encountered it. If you want to automatically revisit any nodes, you just need to bump the last_visit_number before starting the crawl.

By your description, I am not quite sure how you are getting stuck.

Edit ------ I just noticed you had a concrete question. In order to increase how quickly you pull in new data, I would keep track of the number of times a given user was linked to in your data (imported or not yet imported). When choosing a user to crawl, I would pick users that have a low number of links. I would specifically go for either the lowest number of links or a random choice among the users with the lowest number of links.

Jacob

TheJacobTaylor
The 10 seconds comes from having to scrape a few pages of information on the user and then transform it to my database format. Most of it is network time.
Paul Tarjan
As for the choice of new users, very interesting. I'll try out counting the inlinks for users and only spidering from low inlinked users.
Paul Tarjan
Why so few threads? Are you worried that they will block you? I was going to suggest a hash for each node (a.la Pavel). One thing you could do is create an incrementing id and use a simple mapping table to cross reference them.
TheJacobTaylor
Yes, I'm worried the remote site will block me. Politeness and all that (at Yahoo! I know our crawlers aim for 2 requets / host at any time). With the in-memory-set working, I don't need to optimize my crawl path, but if the other site grows in order of magnitude I'll have to move to your strategy. Thanks!
Paul Tarjan
One, mildly twisted, option would be to leverage a search engine's results through their API to get the data that you want. If you are building your initial database, the cached page, if available, would probably give you the data you need. The search engines run at a huge velocity and would probably not mind a few more threads.
TheJacobTaylor
Let me know how the in-link prioritization works. :)
TheJacobTaylor
+6  A: 

To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.

As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.

Pavel Shved
the users are actually character strings of average length 15. I tried having a dict with {username1 : True, username2 : True} but that quickly hit 100% memeory and the machine locked up. Maybe it is just inefficient in python to use a dict?
Paul Tarjan
one possible solution would be just to store hashes of the usernames
cobbal
also, a set is better suited to that type of storage than a dict
cobbal
Are sets O(1) lookups?
Paul Tarjan
In the docs (http://docs.python.org/library/stdtypes.html#set-types-set-frozenset) it says, "Set elements, like dictionary keys, must be hashable" so I assume so
cobbal
and the lookup is good, too :-)http://blog.tplus1.com/index.php/2008/03/17/pythons-hash-lookup-is-insanely-good/
Francesco
Sets use O(1) lookup, of course. http://wiki.python.org/moin/TimeComplexity
kaizer.se
Yup, making the memory model into a set() and just keeping the whole history in memory did it. I traversed 23k last night and I'm only using 800MB of memory. It still will be close but I'll move to hashes instead of character strings in a few days. Thanks!
Paul Tarjan
+2  A: 

There is no particular algorithm that will help you optimise the construction of a graph from scratch. One way or another, you are going to have to visit each node at least once. Whether you do this depth first or breadth first is irrelevant from a speed perspective. Theran correctly points out in a comment below that breadth-first search, by exploring nearer nodes first, may give you a more useful graph immediately, before the whole graph is completed; this may or may not be a concern for you. He also notes that the neatest version of depth-first search is implemented using recursion, which could potentially be a problem for you. Note that recursion is not required, however; you can add incompletely explored nodes to a stack and process them linearly if you wish.

If you do a simple existence check for new nodes (O(1) if you use a hash for lookup), then cycles will not be a problem at all. Cycles are only a concern if you do not store the complete graph. You can optimise searches through the graph, but the construction step itself will always take linear time.

I agree with other posters that the size of your graph should not be a problem. 250,000 is not very large!

Regarding concurrent execution; the graph is updated by all threads, so it needs to be a synchronised data structure. Since this is Python, you can make use of the Queue module to store new links still to be processed by your threads.

ire_and_curses
BFS might be better because it will look at the nearest nodes to the initial one first, which is likely to give a useful subset early on. BFS also avoids the risk of recursion 250,000 levels deep and could keep its queue in the same DB as the final graph (assuming a RDBMS).
Theran
You can certainly make DFS without the problem of the deep stack trace: the only real difference between DFS and BFS is in BFS you add nodes to a queue; in DFS, a stack. Same algorithm, different data structure—and thus, different semantics.
Michael Deardeuff
@Theran, Michael: +1 Thanks - answer adjusted to make this clarification.
ire_and_curses
Sounds like a simple BFS (like I have) is in order. I am100% with you with the 250,000 being small, but my original implementation quickly ran out of memeory doing a dictionary of {username1 : True, username2 : True}. Maybe I should post that as another issue.
Paul Tarjan
A: 

Although you say that getting a friend list takes a lot of time (10 seconds or more), a variant of good-old Dijkstra's algorithm just might work:

  1. Get any node.
  2. Get a connection from any node you already loaded.
  3. If the other end hasn't been loaded yet, add the node to the graph.
  4. Go to step 2.

The trick is to select the connection you load in step 2 in a smart way. A few short remarks about this:

  • You should somehow prevent the same connection to be loaded twice or more often. Selecting a random connection and discard it if it's been loaded already is very inefficient if you're after all connections.
  • If you want to load all connections eventually, load all connections of a node at the same time.

In order to really say something about efficiency, please provide more details about datastructure.

Martijn