views:

263

answers:

6

My (Python) AppEngine program fetches a web page from another site to scrape data from it -- but it seems like the 3rd party site is blocking requests from Google App Engine! -- I can fetch the page from development mode, but not when deployed.

Can I get around this by using a free proxy of some sort?

Can I use a free proxy to hide the fact that I am requesting from App Engine?

How do I find/choose a proxy? -- what do I need? -- how do I perform the fetch?

Is there anything else I need to know or watch out for?

+3  A: 

Probably the correct approach is to request permission from the owners of the site you are scraping.

Even if you use a proxy, there is still a big chance that requests coming through the proxy will end up blocked as well.

Daniel Vassallo
Ask Permission?...well...I'd rather not ( it's a torrent listing site ). Besides, I don't think they would be able to un-block just my Google App without un-blocking ALL Google Apps. Maybe they block Google Apps because other people (not me) write obnoxious Google App bots that hit their servers too much. I would still like to try a proxy -- I just don't know how to go about it. ( Maybe I could run my own proxy at home for this single purpose? hmm...)
Nick Perkins
how does it being a torrent site affect your ability to ask permission? Is there a policy on the site that you are breaking?
Peter Recore
+1  A: 

Have you considered changing the user-agent?

result = urlfetch.fetch(u,headers = {'User-Agent': "Mozilla/5.0"},allow_truncated=True) 

The API will always append "AppEngine-Google;" to the user-agent, but this might work if the restriction is not based on a IP address range.

jbochi
Thanks for the idea, but this did not work ( in this case ).
Nick Perkins
A: 

Hi,

I'm currently having the same problem and i was thinking about this solution (not yet tried) :

-> develop an app that fetch what you want -> run it locally -> fetch your local server from your initial

so the proxy is your computer which you know as not blocked

Let me know if it's works !

MrGoodFriend
Yes, that would work...but sort of defeats the purpose of using App Engine ( not having to run your own server ).In the end, I just switched to another website ( pirate bay ) that does respond to requests from App Engine.( the result is http://nicksmovietorrents.appspot.com )
Nick Perkins
A: 

I'm currently having this same problem. Has anyone gotten this working?

MrGoodFriend, do your potential solution work? I am not exactly sure what you mean by " fetch your local server from your initial".

Andrew
A: 

Well to be fair, if they don't want you doing that then you probably shouldn't. It's not nice to be mean.

But if you really want to do it, the best approach would be creating a simple proxy script and running it on a VPS or some computer with a decent enough connection.

Basically you expose a REST API from your server to your GAE, then the server just makes all the same requests it gets to the target site and returns the output.

Swizec Teller
A: 

What you are talking about is a valid bug in app engine sdk. Have a look at http://code.google.com/p/googleappengine/issues/detail?id=544 for bug updates, and workarounds for java and python.

pranny