views:

162

answers:

1

I am trying to crawl a site using mechanize. The site provides search results in different pages. When posting to get the next set of results, something is wrong and the server redirects me to the first page, asking mechanize to update the SearchSession Cookie.

I have been debugging the requests using Firefox and they look quite the same, and I am unable to find the problem. Any suggestion? Below the requests:

----------- FIRST THE RIGHT SEQUENCE, USING TAMPER IN FIREFOX ------------------------- POST XXX/JobSearch/Results.aspx?Keywords=Python&LTxt=London%2c+South+East&Radius=0&LIds2=ZV&clid=1621&cltypeid=2&clName=London Load Flags[LOAD_DOCUMENT_URI LOAD_INITIAL_DOCUMENT_URI ] Content Size[-1] Mime Type[text/html] Request Headers: Host[www.cwjobs.co.uk] User-Agent[Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100401 Ubuntu/9.10 (karmic) Firefox/3.5.9] Accept[text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8] Accept-Language[en-us,en;q=0.5] Accept-Encoding[gzip,deflate] Accept-Charset[ISO-8859-1,utf-8;q=0.7,*;q=0.7] Keep-Alive[300] Connection[keep-alive] Referer[XXX/JobSearch/Results.aspx?Keywords=Python&LTxt=London%2c+South+East&Radius=0&LIds2=ZV&clid=1621&cltypeid=2&clName=London] Cookie[ecos=774803468-0; AnonymousUser=MemberId=acc079dd-66b6-4081-9b07-60d6955ee8bf&IsAnonymous=True; PJBIPPOPUP=; WT_FPC=id=86.181.183.106-2262469600.30073025:lv=1272812851736:ss=1272812789362; SearchSession=SessionGuid=71de63de-3bd0-4787-895d-b6b9e7c93801&LogSource=NAT] Post Data: __EVENTTARGET[srpPager%24btnForward] __EVENTARGUMENT[] hdnSearchResults[BV%2CA%2CC0P5x%2COou-%2CB4S-%2CBuC-%2CDzx-%2CHwn-%2CKPP-%2CIVA-%2CC9D-%2CH6X-%2CH7x-%2CJ0x-%2CCvX-%2CCra-%2COHa-%2CHhP-%2CCoj-%2CBlM-%2CE9W-%2CIm8-%2CBqG-%2CPFy-%2CN%2Fm-%2Ceaa%2CCvj-%2CCtJ-%2CCr7-%2CBpu-%2Cmh%2CMb6-%2CJ%2Fk-%2CHY8-%2COJ7-%2CNtF-%2CEya-%2CErT-%2CEo4-%2CEKU-%2CDnL-%2CC5M-%2CCyB-%2CBsD-%2CBrc-%2CBpU-%2Col%2C30%2CC1%2Cd4N%2COo8-%2COi0-%2CLz%2F-%2CLxP-%2CFyp-%2CFVR-%2CEHL-%2CPrP-%2CLmE-%2CK3H-%2CKXJ-%2CFyn%2CIcq-%2CIco-%2CIK4-%2CIIg-%2CH2k-%2CH0N-%2CHwp-%2CHvF-%2CFij-%2CFhl-%2CCwj-%2CCb5-%2CCQj-%2CCQh-%2CB%2B2-%2CBc6-%2ChFo%2CNLq-%2CNI%2F-%2CFzM-%2Cdu-%2CHg2-%2CBug-%2CBse-%2CB9Q-] __VIEWSTATE[%2FwEPDwUKLTkyMzI2ODA4Ng9kFgYCBA8WBB4EaHJlZgWJAWh0dHA6Ly93d3cuY3dqb2JzLmNvLnVrL0pvYlNlYXJjaC9SU1MuYXNweD9LZXl3b3Jkcz1QeXRob24mTFR4dD1Mb25kb24lMmMrU291dGgrRWFzdCZSYWRpdXM9MCZMSWRzMj1aViZjbGlkPTE2MjEmY2x0eXBlaWQ9MiZjbE5hbWU9TG9uZG9uHgV0aXRsZQUkTGF0ZXN0IFB5dGhvbiBqb2JzIGZyb20gQ1dKb2JzLmNvLnVrZAIGDxYCHgRUZXh0BV48bGluayByZWw9ImNhbm9uaWNhbCIgaHJlZj0iaHR0cDovL3d3dy5jd2pvYnMuY28udWsvSm9iU2Vla2luZy9QeXRob25fTG9uZG9uX2wxNjIxX3QyLmh0bWwiIC8%2BZAIIEGRkFg4CBw8WAh8CBV9Zb3VyIHNlYXJjaCBvbiA8Yj5LZXl3b3JkczogUHl0aG9uOyBMb2NhdGlvbjogTG9uZG9uLCBTb3V0aCBFYXN0OyA8L2I%2BIHJldHVybmVkIDxiPjg1PC9iPiBqb2JzLmQCCQ8WAh4HVmlzaWJsZWhkAgsPFgIfAgUoVGhlIG1vc3QgcmVsZXZhbnQgam9icyBhcmUgbGlzdGVkIGZpcnN0LmQCEw8PFgIeC05hdmlnYXRlVXJsBQF%2BZGQCFQ9kFgYCBQ8PFgYfAgUGUHl0aG9uHgtEZWZhdWx0VGV4dAUMZS5nLiBhbmFseXN0HhNEZWZhdWx0VGV4dENzc0NsYXNzZWRkAgsPDxYGHwIFEkxvbmRvbiwgU291dGggRWFzdB8FBQllLmcuIEJhdGgfBmVkZAIRDxAPFgYeDURhdGFUZXh0RmllbGQFClJhZGl1c05hbWUeDkRhdGFWYWx1ZUZpZWxkBQZSYWRpdXMeC18hRGF0YUJvdW5kZ2QQFREHMCBtaWxlcwcyIG1pbGVzBzUgbWlsZXMIMTAgbWlsZXMIMTUgbWlsZXMIMjAgbWlsZXMIMjUgbWlsZXMIMzAgbWlsZXMIMzUgbWlsZXMINDAgbWlsZXMINDUgbWlsZXMINTAgbWlsZXMINjAgbWlsZXMINzAgbWlsZXMIODAgbWlsZXMIOTAgbWlsZXMJMTAwIG1pbGVzFREBMAEyATUCMTACMTUCMjACMjUCMzACMzUCNDACNDUCNTACNjACNzACODACOTADMTAwFCsDEWdnZ2dnZ2dnZ2dnZ2dnZ2dnZGQCFw9kFgQCAQ9kFgQCBA8QZA8WA2YCAQICFgMQBQhBbGwgam9icwUBMGcQBRlEaXJlY3QgZW1wbG95ZXIgam9icyBvbmx5BQEyZxAFEEFnZW5jeSBqb2JzIG9ubHkFATFnZGQCBg8QZA8WA2YCAQICFgMQBQlSZWxldmFuY2UFATFnEAUERGF0ZQUBMmcQBQZTYWxhcnkFATNnZGQCBQ8PFgYeClBhZ2VOdW1iZXICAh4PTnVtYmVyT2ZSZXN1bHRzAlUeDlJlc3VsdHNQZXJQYWdlAhRkZAIZDxYCHwNoZGQ%3D] Refinesearch%24txtKeywords[Python] Refinesearch%24txtLocation[London%2C+South+East] Refinesearch%24ddlRadius[0] ddlCompanyType[0] ddlSort[1] Response Headers: Cache-Control[private] Date[Sun, 02 May 2010 16:09:27 GMT] Content-Type[text/html; charset=utf-8] Expires[Sat, 02 May 2009 16:09:27 GMT] Server[Microsoft-IIS/6.0] X-SiteConHost[P310] X-Powered-By[ASP.NET] X-AspNet-Version[2.0.50727] Set-Cookie[SearchSession=SessionGuid=71de63de-3bd0-4787-895d-b6b9e7c93801&LogSource=NAT; path=/] Content-Encoding[gzip] Vary[Accept-Encoding] Transfer-Encoding[chunked]

-------- NOW WHAT I'AM SENDING USING MECHANIZE, SOME HEADERS ADDED, ETC ----------- POST /JobSearch/Results.aspx?Keywords=Python&LTxt=London%2c+South+East&Radius=0&LIds2=ZV&clid=1621&cltypeid=2&clName=London HTTP/1.1\r\nContent-Length: 2424\r\n Accept-Language: en-us,en;q=0.5\r\n Accept-Encoding: gzip\r\n Host: www.cwjobs.co.uk\r\n Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\r\n Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n Connection: keep-alive\r\n Cookie: AnonymousUser=MemberId=8fa5ddd7-17ed-425e-b189-82693bfbaa0c&IsAnonymous=True; SearchSession=SessionGuid=33e4e439-c2d6-423f-900f-574099310d5a&LogSource=NAT\r\n Referer: XXX/JobSearch/Results.aspx?Keywords=Python&LTxt=London%2c+South+East&Radius=0&LIds2=ZV&clid=1621&cltypeid=2&clName=London\r\n Content-Type: application/x-www-form-urlencoded\r\n\r\n' '__EVENTTARGET=srpPager%24btnForward& __EVENTARGUMENT=& hdnSearchResults=BV%2CA%2CC0eif%2CMwc%2CM6s%2COou%2CK09%2CG4H%2CEZf%2CGTu%2CLrr%2CGuX%2CGs9%2CEz9%2CL5X%2CL9U%2ChU%2CHHf%2CMAL%2CNDi%2CJrY%2CGBy%2CM%2Bo%2CdE-%2CpI%2CtDI%2CL5L%2CL7l%2CL8z%2CM%2FA%2CPPP%2CCM0%2CEpK%2CHPy%2Cez%2C7p%2CJ2U%2CJ9b%2CJ%2F2%2CKea%2CLBj%2CLvi%2CL2t%2CM8r%2CM9S%2CM%2Fa%2CPRT%2CPgi%2Csg7%2CF6%2CI2F%2CJTd%2CO-%2CC0v%2CC3f%2CDCq%2CDxn%2CERl%2CUbV%2CGME%2CGMG%2CGd2%2CGgO%2CGyK%2CG0h%2CG4F%2CG5p%2CJGL%2CJHJ%2CKhj%2CL4L%2CMM1%2CMYL%2CMYN%2CMp4%2CNL0%2COrj%2CvuW%2CBdE%2CBfv%2CI1i%2CBCh-%2COLA%2CHH4%2CM6O%2CM8Q%2CMre& __VIEWSTATE=%2FwEPDwUKLTkyMzI2ODA4Ng9kFgYCBA8WBB4EaHJlZgWJAWh0dHA6Ly93d3cuY3dqb2JzLmNvLnVrL0pvYlNlYXJjaC9SU1MuYXNweD9LZXl3b3Jkcz1QeXRob24mTFR4dD1Mb25kb24lMmMrU291dGgrRWFzdCZSYWRpdXM9MCZMSWRzMj1aViZjbGlkPTE2MjEmY2x0eXBlaWQ9MiZjbE5hbWU9TG9uZG9uHgV0aXRsZQUkTGF0ZXN0IFB5dGhvbiBqb2JzIGZyb20gQ1dKb2JzLmNvLnVrZAIGDxYCHgRUZXh0BV48bGluayByZWw9ImNhbm9uaWNhbCIgaHJlZj0iaHR0cDovL3d3dy5jd2pvYnMuY28udWsvSm9iU2Vla2luZy9QeXRob25fTG9uZG9uX2wxNjIxX3QyLmh0bWwiIC8%2BZAIIEGRkFg4CBw8WAh8CBV9Zb3VyIHNlYXJjaCBvbiA8Yj5LZXl3b3JkczogUHl0aG9uOyBMb2NhdGlvbjogTG9uZG9uLCBTb3V0aCBFYXN0OyA8L2I%2BIHJldHVybmVkIDxiPjg1PC9iPiBqb2JzLmQCCQ8WAh4HVmlzaWJsZWhkAgsPFgIfAgUoVGhlIG1vc3QgcmVsZXZhbnQgam9icyBhcmUgbGlzdGVkIGZpcnN0LmQCEw8PFgIeC05hdmlnYXRlVXJsBQF%2BZGQCFQ9kFgYCBQ8PFgYfAgUGUHl0aG9uHgtEZWZhdWx0VGV4dAUMZS5nLiBhbmFseXN0HhNEZWZhdWx0VGV4dENzc0NsYXNzZWRkAgsPDxYGHwIFEkxvbmRvbiwgU291dGggRWFzdB8FBQllLmcuIEJhdGgfBmVkZAIRDxAPFgYeDURhdGFUZXh0RmllbGQFClJhZGl1c05hbWUeDkRhdGFWYWx1ZUZpZWxkBQZSYWRpdXMeC18hRGF0YUJvdW5kZ2QQFREHMCBtaWxlcwcyIG1pbGVzBzUgbWlsZXMIMTAgbWlsZXMIMTUgbWlsZXMIMjAgbWlsZXMIMjUgbWlsZXMIMzAgbWlsZXMIMzUgbWlsZXMINDAgbWlsZXMINDUgbWlsZXMINTAgbWlsZXMINjAgbWlsZXMINzAgbWlsZXMIODAgbWlsZXMIOTAgbWlsZXMJMTAwIG1pbGVzFREBMAEyATUCMTACMTUCMjACMjUCMzACMzUCNDACNDUCNTACNjACNzACODACOTADMTAwFCsDEWdnZ2dnZ2dnZ2dnZ2dnZ2dnZGQCFw9kFgQCAQ9kFgQCBA8QZA8WA2YCAQICFgMQBQhBbGwgam9icwUBMGcQBRlEaXJlY3QgZW1wbG95ZXIgam9icyBvbmx5BQEyZxAFEEFnZW5jeSBqb2JzIG9ubHkFATFnZGQCBg8QZA8WA2YCAQICFgMQBQlSZWxldmFuY2UFATFnEAUERGF0ZQUBMmcQBQZTYWxhcnkFATNnZGQCBQ8PFgYeClBhZ2VOdW1iZXICAR4PTnVtYmVyT2ZSZXN1bHRzAlUeDlJlc3VsdHNQZXJQYWdlAhRkZAIZDxYCHwNoZGQ%3D& Refinesearch%24txtKeywords=Python& Refinesearch%24txtLocation=London%2CSouth+East& Refinesearch%24ddlRadius=0& Refinesearch%24btnSearch=Search& ddlCompanyType=0& ddlSort=1'

+1  A: 

The SearchSession cookies are quite different: the working one has

SearchSession=SessionGuid=71de63de-3bd0-4787-895d-b6b9e7c93801

and the non-working one has

SearchSession=SessionGuid=33e4e439-c2d6-423f-900f-574099310d5a

Do you have any way to independently validate why the second one might not be acceptable for the server? (This may not be the case, but since the server's complaining exactly about your SearchSession cookie, it seems it should be the first line of inquiry).

Alex Martelli