views:

806

answers:

5

For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environment.

   def crawl(self, url):
    "Get all URLS from which to scrape categories."
    try:
      links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
      return
    for link in links:
      for attr in link.attrs:
        if Crawler._match_attr(attr):
          if Crawler._is_category(attr):
            pass
          elif attr[1] not in self._crawled:
            self._crawled.append(attr[1])
            self.crawl(attr[1])
+4  A: 

Use a BFS instead of crawling recursively (DFS): http://en.wikipedia.org/wiki/Breadth_first_search

You can use an external storage solution (such as a database) for BFS queue to free up RAM.

The algorithm is:

//pseudocode:
var urlsToVisit = new StorageObject(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new List(); // List of visited URLs.

// initialization:
urlsToVisit.Add( rootUrl );

while(urlsToVisit.Count > 0) {
  var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
  var page = FetchPage(nextUrl);
  ProcessPage(page);
  visitedUrls.Add(nextUrl);
  var links = ParseLinks(page);
  foreach (var link in links)
     if (!visitedUrls.Contains(link))
        urlsToVisit.Add(link); 
}
Mehrdad Afshari
+6  A: 

Instead of recursing, you could put the new URLs to crawl into a queue. Then run until the queue is empty without recursing. If you put the queue into a file this uses almost no memory at all.

Ber
Or a stack -- pushing onto a stack gives you depth-first search.
S.Lott
A: 

You can do it pretty easily just by using links as a queue:

def get_links(url):
    "Extract all matching links from a url"
    try:
        links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
        return []

def crawl(self, url):
    "Get all URLS from which to scrape categories."
    links = get_links(url)
    while len(links) > 0:
        link = links.pop()
        for attr in link.attrs:
            if Crawler._match_attr(attr):
                if Crawler._is_category(attr):
                    pass
            elif attr[1] not in self._crawled:
                self._crawled.append(attr[1])
                # prepend the new links to the queue
                links = get_links(attr[1]) + links

Of course, this doesn't solve the memory problem...

tghw
+2  A: 

@Mehrdad - Thank you for your reply, the example you provided was concise and easy to understand.

The solution:

  def crawl(self, url):
    urls = Queue(-1)
    _crawled = []

    urls.put(url)

    while not urls.empty():
      url = urls.get()
      try:
        links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
      except urllib2.HTTPError:
        continue
      for link in links:
        for attr in link.attrs:
          if Crawler._match_attr(attr):
            if Crawler._is_category(attr):
              continue
            else:
              Crawler._visit(attr[1])
              if attr[1] not in _crawled:
                urls.put(attr[1])
samuraisam
A: 

Hi All, In the sample given below I got all the anchor tags of a webpage and by crawling i also have fetched the child node of a given root node(anchor tag) till nth level.Now i need to create a tree structure for the same such that i should be able to differentiate between root node and child node. Can anyone suggest something

import sys import urllib2 import urlparse import htmllib, formatter from cStringIO import StringIO import re from BeautifulSoup import BeautifulSoup

def crawl_page(html): writer = formatter.DumbWriter(StringIO()) f = formatter.AbstractFormatter(writer) parser = htmllib.HTMLParser(f) parser.feed(html) parser.close()

for i in parser.anchorlist:
    fo = open("c://Keywords.txt", "wb")
    fo.writelines(parser.anchorlist)

return parser.anchorlist

class Spider:

def process_page(self, startURL):
        filehandle = open(startURL, "r")
        data = filehandle.read()     
        for link in crawl_page(data):
            if len(link)==0:
                return
            else:

              self.process_page(link)
              filehandle.close()

if name == 'main':

startURL = "d://PythonSample/Page.html"
spider = Spider()

spider.process_page(startURL)
print startURL
Neha