views:

181

answers:

4

What should I do when I see some IP in my logs scrolling through 100s of pages on my site? I have a wordpress blog, and it seems like this isn't a real person. This happens almost daily with different IPs.

UPDATE: Oh, i forgot to mention, I'm pretty sure it's not a search engine spider. The hostname is not a searchengine, but some random person from india (ends in '.in'). What I am concerned with, is if it is a scraper, is there anything I can do? Or could it possibly be something worse than a scraper e.g. hacker?

A: 

Probably some script kiddie looking to take advantage of an exploit in your blog (or server). That, or some web crawler.

Jeff Moser
+4  A: 

It's a spider/crawler. Search engines use these to compile their listings, researchers use them to figure out the structure of the internet, the Internet Archive uses them to download the contents of the Internet for future generations, spammers use them to search for e-mail addresses, and many more such situations.

Checking out the user agent string in your logs may give you more information on what they're doing. Well-behaved bots will generally indicate who/what they are - Google's search bots, for example, are called Googlebot.

ceejayoz
ok, so the hostnames are not from googlebot.com or something similar, i'm pretty sure it's not a spider
chris
That doesn't mean it's not a spider. There are plenty of obscure spiders out there. Also, user agent and hostname are different things.
ceejayoz
`wget` makes it rather easy to recursively grab all link-accessible pages, even controlling for content types (grab pages + images, pages only, etc), recursive depth, follow foreign links, and so on. So, it *could* be a human being using wget or something similar, but yeah, agreed that it's probably a more automatic spider
Jonathan Fingland
from a security standpoint, just in case it wasn't a spider, is there something i can or should do? thanks
chris
I could create a spider coming from my personal IP address at home. There's FOSS software that will index a slice of the internet for you. Or I could write my own spider software. ALSO, you're afraid that they're screen scraping, well that's not far from spidering. Spiders are basically well organized screen scrapers.
belgariontheking
A: 

It's probably a spider-bot indexing your site. The "User-Agent" might give it away. It is possible to have 100s of GET requests easily for a dynamically generated Wordpress site if it isn't all blog pages but includes things like css, js and images.

sybreon
+2  A: 

If you're concerned about script kiddies, I suggest checking your error logs. The scripts often look for things you may not have; e.g. on one system I run, I don't have ASP, however, I can tell when a script kiddie has probed the site because I see lots attempts to find ASP pages in my error logs.

PTBNL
how do you know it is an attempt to find an ASP page?
chris
As in, the person requests lots of pages such as admin.asp, management.asp, login.asp, et cetera. They just try to find any hole they can as quickly as possible, rather than carefully analyzing.
Paul Fisher
@Chris: Paul's answer matches my experience.
PTBNL