Web log file analysis software to measure search crawlers

Upon following a link to the first page of your Site, the major Search Engine crawlers will first request a file called robots.txt which of course tells the search crawler which pages it is permitted by the Site owner to visit and which files or directories are off limits.

What if you don't have a robots.txt? Nearly always, the crawler 'interprets' this to mean that no pages/directories are off limits and it will proceed to crawl your entire Site. So why include a robots.txt file if that's what you want--i.e., for the crawler to index your entire Site? Because if it's there, the Crawler will nearly always request it so it can read it--this request of course shows up as a line in your server access log file, which is a pretty strong signature for a Crawler.

Second, a good server access log parser such as Webalyzer or Awstats. compare user agent and ip addresses against published, authoritative lists: IAB (http://www.iab.net/sites/spiders/login.php) and the user-agents.org publish the two lists that seem to be the most widely used for this purpose. The former is a few thousand dollars per year and up; the latter is free.

Both Webalyzer and AWStats can do what you want, though i recommend AWStats for the following reasons: it was updated fairly recently (approx. one year ago) while Webalyzer was last updated over eight years ago. In addition, AWStats has much nicer report templates. The advantage of Webalyzer is that is is much faster.

Here's sample output from AWStats (based on out-of-the-box config) that is probably what you are looking for:

alt text

Thank you for your detailed response. I'm more interested in simply making sure our site gets fully crawled by the major search engines. To that end, I really need a tool that tells me how many unique pages are crawled in areas of our site by particular search engines. Can webalyzer do this?

at 2010-09-30 02:47:53

apologies for late reply--i just saw your comment. The answer to the questions in your comment above is 'yes.' I substantially revised my answer in light of your comment--see my revised answer which shows a portion of a sample Search Engine Spiders' report from AWStats

doug 2010-10-26 06:32:09

ansaurus

tags:

views:

answers:

Web log file analysis software to measure search crawlers

related questions