views:

86

answers:

2

i have the below url

http://bit.ly/cDdh1c

When you place the above url in a browser and hit enter it will redirect to the below url http://www.kennystopproducts.info/Top/?hop=arnishad

But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below

Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?

Other url for which iam observing similar behaviour is

  1. http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509 ( via browser)
  2. http://www.ebay.com (via python program)

          maxattempts = 5
          turl = url
          while (maxattempts  >  0) :               
            host,path = urlparse.urlsplit(turl)[1:3]
            if  len(host.strip()) == 0 :
               return None
    
    
    
        try: 
                connection = httplib.HTTPConnection(host,timeout=10)
                connection.request("HEAD", path)
                resp = connection.getresponse()                      
        except:                         
                 return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            self.logger.debug("The present %s is a redirection one" %turl)
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            self.logger.debug("The present url %s is a proper one" %turl)
            return turl
        else :
            #some problem with this url
            return None               
      return None
    

Log file for your reference

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/
+1  A: 

Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.

So, instead try:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')
Jack
Thanks for pointing that.Now it is working fine.
Rama Vadakattu
+1  A: 

Your problem comes from this line :

host,path = urlparse.urlsplit(turl)[1:3]

You're leaving out the query string. So on the example log you're providing, the second HEAD request you will do will be on http://www.cbtrends.com/get-product.html without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/.

You have to calculate the path using all elements of the tuple returned by urlsplit.

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]
Clément