views:

38

answers:

2

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts. I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.

Is this the case?
If so, how do I avoid being banned? should I be using a proxy?

Thanks

A: 

Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?

If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.

Edit

Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.

I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.

Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.

  1. Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.

  2. Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."

  3. Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??

  4. Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.

  5. Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Sam152
I created an iPhone app that monitors user accounts data.Therefore, It needs to login to the remove site and get the new data.Right now I don't have any approval from the remote site.
embedded
My issue is from the server side.My DB needs to be updated once in 4 hours that means that some CRON jobs will need to update whole users and that would be done from the same IP which is the server IP.I would like to know how other services do this kind of task.
embedded
A: 

You get banned for suspicious (or malicious) activity.

If you are running a normal business application inside a normal company intranet you are unlikely to get banned.

Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.

should I be using a proxy?

A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.

richj
I think you misunderstood me.I created an iPhone app that monitors user accounts data.The CRON jobs login to remote site and fetch the new data available for each user.I just need to avoid getting banned.
embedded