tags:

views:

1303

answers:

3
+2  Q: 

Tor with Python?

I'm trying to crawl websites using a crawler written in Python. I want to integrate Tor with Python meaning I want to crawl the site anonymously using Tor.

I have no idea how to do that. Any help on how to do that will be very helpful.

I tried doing this. It doesn't seem to work. I checked my ip it is still the same as the one before i used tor. i checked it via python.

import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
+1  A: 

Perhaps you're having some network connectivity issues? The above script worked for me (I substituted a different URL - I used http://stackoverflow.com/ - and I get the page as expected:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" >
 <html> <head>

<title>Stack Overflow</title>        
<link rel="stylesheet" href="/content/all.css?v=3856">

(etc.)

Vinay Sajip
A: 

Using privoxy as http-proxy in front of tor works for me - here's a crawler-template:


import urllib2
import httplib

from BeautifulSoup import BeautifulSoup
from time import sleep

class Scraper(object):
    def __init__(self, options, args):
        if options.proxy is None:
            options.proxy = "http://localhost:8118/"
        self._open = self._get_opener(options.proxy)

    def _get_opener(self, proxy):
        proxy_handler = urllib2.ProxyHandler({'http': proxy})
        opener = urllib2.build_opener(proxy_handler)
        return opener.open

    def get_soup(self, url):
        soup = None
        while soup is None:
            try:
                request = urllib2.Request(url)
                request.add_header('User-Agent', 'foo bar useragent')
                soup = BeautifulSoup(self._open(request))
            except (httplib.IncompleteRead, httplib.BadStatusLine,
                    urllib2.HTTPError, ValueError, urllib2.URLError), err:
                sleep(1)
        return soup

class PageType(Scraper):
    _URL_TEMPL = "http://foobar.com/baz/%s"

    def items_from_page(self, url):
        nextpage = None
        soup = self.get_soup(url)

        items = []
        for item in soup.findAll("foo"):
            items.append(item["bar"])
            nexpage = item["href"]

        return nextpage, items

    def get_items(self):
        nextpage, items = self._categories_from_page(self._START_URL % "start.html")
        while nextpage is not None:
            nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
            items.extend(newitems)
        return items()

pt = PageType()
print pt.get_items()
+8  A: 

You are trying to connect to the control port, try connecting to the actual TOR port : 8118.

Example:

    proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
    opener = urllib2.build_opener(proxy_support) 
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    print opener.open('http://www.google.com').read()

Also please note properties passed to ProxyHandler, no http prefixing the ip:port

Dmitri Farkov