tags:

views:

39

answers:

1

I am trying to store the links that I scrape from a site in a non binary tree. The links are laid out hierarchically (obviously). The question is how do I generate the tree ? I mean, how am I going to work my way through the pages provided by the link so that I know who is who's child.

For now I can get the first and the second level of links, but have no idea how to go from here besides that I have to recursively have to build it and have a way to stop when I get to a leaf (which I have).

What I was thinking was something like (code in Python):

def buildTree(root):
for node in root.children:
    if <end condition here>:
        continue
    else:
        nodes = getNodes(urllib2.urlopen(node.url).read())
        node.addChildren(nodes)
        buildTree(node)

where root and nodes are a user defined Node class

+1  A: 

Obviously, the links in a site are not a tree, but a graph. You should have a Page object, which is identified by a URL, and a Link object, which points from one page to another (and Page A can point to page B, while page B is pointing to Page A, making it a graph, instead of a tree).

Scanning algorithm pseudo-code:

process_page(current_page):
    for each link on the current_page: 
    if target_page is not already in your graph:
        create a Page object to represent target_page
        add it to to_be_scanned set
    add a link from current_page to target_page

scan_website(start_page)
    create Page object for start_page
    to_be_scanned = set(start_page)
    while to_be_scanned is not empty:
        current_page = to_be_scanned.pop()
        process_page(current_page)
Ofri Raviv
Yes, it's totally a graph and not a tree. Thanks!
hyperboreean