I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, but I don't want to have my IP blacklisted or anything like that. Does anyone have any knowledge on enterprise sites and the restrictions they have on things like this?
Furthermore, if there are restrictions in place, is there anything I can do to circumvent them? At the moment all I can think of to help the problem slightly is; adding a random delay between calls to the site (between 0 and 5 seconds) or running the script through random proxies to disguise the requests.
By having to do things like the methods above, it makes me feel as if I'm doing the wrong thing. I would be annoyed if they were to block me for whatever reason because blogger.com is owned by Google and their main product is a web spider. Allbeit, their spider does not send its requests to just one website.