views:

174

answers:

1

I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the extra complexity of EC2 and distributed data, only to find that all the instances use the same address block and I get blocked by the server anyway.

NOTE: This isn't for a DoS attack or anything. I'm trying to harvest data for a legitimate business purpose, I'm respecting robots.txt, and I'm only making one request per second, but the host is still shutting me down.

Edit: Commenter Paul Dixon suggests that the act of blocking even my modest crawl indicates that the host doesn't want me to crawl them and therefore that I shouldn't do it (even assuming I can work around the blocking). Do people agree with this?

+3  A: 

First, the answer - yes, each EC2 instance gets its own IP address. Now on to some commentary:

  • It's easy for a site owner to block all requests from EC2-land, and some webmaster have started doing that, due to many poorly behaved bots running in EC2. So using EC2 might not be a long term solution to your problem.

  • One request/second is still pretty fast. Super-polite is using a crawl delay of 30 seconds. At Bixo Labs we usually run with a crawl delay of 15 seconds - even 10 seconds starts causing problems at some sites.

  • You also need to worry about total requests/day, as some sites monitor that. A good rule of thumb is no more than 5000 requests/day/IP address.

  • Finally, using multiple servers in EC2 to get around rate-limiting means you're in the gray zone of web crawling, mostly inhabited by slimy characters harvesting email addresses, ripping off content, and generating splog. So consider carefully if you really want to be living in that neighborhood.

kkrugler
On the technical question, just so I understand, are these addresses going to be in the same subnet, so they're obviously related to each other? How different can you arrange for them to be?Regarding the ethical point, no, I don't want to be with the spammers, but on the other hand, I'm gathering this info for a service that my company provides, adding value for paying customers. If we had a large infrastructure, we could just distribute the crawl ourselves, but being small, I'm considering how else to accomplish it. There should be a way for small businesses to do legitimate crawling.
Joshua Frank
@Joshua EC2 is using different subnets, but all of them are identifiable. Simply do a whois lookup on any IP address. As most hosts/firewalls won't use whois data to protect themselves, you may only try to find a subnet that wans't blocked. I'd bet though, that this net will be blocked as soon as somebody notices a crawler coming from it.
sfussenegger
I guess I can try this, but I think blocking the whole subnet is extreme, and doesn't let the host distinguish between respectful crawlers doing reasonable requests, and the bad guys.
Joshua Frank
@Joshua: Webmasters only feel it necessary to let Google crawl their sites. Everyone else will be blocked the moment they start being at all obnoxious in their usage patterns. Yes, this isn't nice. No, they don't care.
Donal Fellows