tags:

views:

147

answers:

3

building a depth-first web spider, meaning it will visit all links on first page, and go to each link, and visit links on all second page...

should you use recursion ? i find this to be cpu intensive.

def recursion()

linkz_on_first_page.each do |link|

recursion(link)

end
end
recursion(firstpage)
+12  A: 

Definitely not, you're going to run into problems very quickly because of the actual nature of the world wide web. The second you hit a site with a main navigation section, where each page links to each other page, you've entered an infinite loop.

You could keep track of which links you have handled, but even then, a recursive loop doesn't really fit the nature of the world wide web (although at first thought it seems to, the web is more of an actual "web" than a tree). You're better off finding all links on the current page and adding those links (if they don't already exist) to a central queue, and proceeding iteratively through the queue processing every link as you come to it (remember to keep track of links that you've finished processing, or you'll add them to the end of the queue again)

LorenVS
My immediate thought was also "queue". With the possible addition of a little de-duplicating intelligence, I expect that it would also provide the basis for a very robust scaling capability.
Mike Woodhouse
how do i create the queue ? do you mean create a large array containing all links, and before navigating to each link, check that this has been not visited before ?
pgh
In theory an array would be enough, in practice you're going to have performance/scalibility issues as your central queue is going to be the bottleneck. You might get into hashing or something to reduce lookup time.
sebastiangeiger
To be perfectly honest I've never coded a line of ruby in my life, so I can't tell you what sort of data structures ruby has built into it, although, being a more dynamic language it probably has really good support for dictionaries/hash-tablesMy first instinct would be to have a single array or arraylist acting as the bucket for links that still need to be processed, and then either a sorted list (sorted on the link) or a hash table (with the link as the key) to keep track of which links you have already visited...
LorenVS
+4  A: 

Recursion seems to be appropriate - until you think a bit more about it. In case you've got a page A containing a link to page B and page B containing a link to page A you're stuck in an endless cycle.

Recursion is not any more CPU intensive than doing it in a "normal" way. You have to ask yourself whether you need all the links or if it's suffice to only grab links down to a certain level. In the latter case this also solves your endless cycle problem.

If you need all links I would rather use a list of links in which each link is unique. Your program takes a link from the list and spiders this link. Once a new link is discovered you try to insert it into this list and so on.

sebastiangeiger
damn... too slow.
sebastiangeiger
+1  A: 

It's not so much that recursion is CPU intensive (it's not really), but more that you'll blow up your call stack after a few thousand recursive calls - which you would easily hit writing a web spider running on the open internet.

Example:

def blow_stack(level=0)
  puts "at level #{level}"
  blow_stack(level+1)
end

Output of this on my Macbook pro:

irb(main):009:0> blow_stack
at level 0
at level 1
... (skip a bunch of output)
at level 6295
at level 6296
SystemStackError: stack level too deep
        from (irb):7:in `blow_stack'
        from (irb):7:in `blow_stack'
        from (irb):9
        from :0
madlep
what if i call stacks, and they are appropriately closed.
pgh
Not sure if I follow that question pgh.When I say "call stack" I'm referring to the function call stack that most languages (including Ruby) use to track which functions have been called, and where to return to. This stack is of finite size (usually room for a few thousand method call layers), so if you're doing deep recursion, you'll run out of stack space.This is different to a regular "stack" which is accessed by a regular non-recursive function. That would work fine.
madlep