views:

1044

answers:

3

If I have a forums site with a large number of threads, will the search engine bot crawl the whole site every time? Say I have over 1,000,000 threads in my site, will they get crawled every time the bot crawls my site? or how does it work? I want my website to be indexed but I don't want the bot to kill my website! In other words I don't want the bot to keep crawling the old threads again and again every time it crawls my website.

Also, what about the pages crawled before? Will the bot request them every time it crawls my website to make sure they are still on the site? I'm asking this because I only link to the latest threads, i.e. there's a page that contains a list of all the latest threads, but I don't link to the older threads, they have to be explicitly requested by URL, e.g. http://example.com/showthread.aspx?threadid=7, will this work to stop the bot from bringing my site down and consuming all my bandwidth?

P.S. The site is still under development but I want to know in order to design the site so that search engine bots don't bring it down.

+7  A: 

Complicated stuff.

From my experience, it depends more on what URL scheme do you use to link pages together that will determine if the crawler will crawls which pages.

  • Most engine crawl the entire website, if it is all properly hyperlinked with a crawl-friendly URLs e.g. use URL rewriting instead of a topicID=123 querystrings and that all pages are easily linkable a few clicks from the main page.

  • Another case is paging, if you have paging sometimes the bot crawl just the first page and stops when it finds the next-page link keeps hitting the same document e.g. one index.php for the entire website.

  • You wouldn't want a bot to accidently hit some webpage that perform certain actions e.g. a "Delete topic" link that links to "delete.php?topicID=123" so most crawlers will check for those cases as well.

  • The Tools page at SEOmoz also provide a lot of information and insight about the way some crawlers work and what information it will extract and chew etc. You could use those to determine wether the pages deep inside your forum e.g. a year-old post might gets crawled or not.

  • And some crawlers enable you to customize their crawling behavior... something like Google Sitemaps. You could tell them to do-crawl and don't-crawl which pages and on which order etc. I remember there are such services available from MSN and Yahoo as well but have never tried it out myself.

  • You can throttle the crawling bot so it doesn't overwhelm your website by providing a robots.txt file in the website root.

Basically, if you design your forum so that the URLs doesn't look hostile to the crawlers, it'll merrily crawls the entire website.

chakrit
A: 

To build on what chakrit said, some search engines (Google in particular) will only index pages that have only one or two parameters. After that the page is generally ignored probably because it's seen as being too dynamic and therefore an unreliable URL.

It's best to create SEO friendly URLS that are devoid of parameters but instead hide the implementation behind something like mod_rewrite in Apache or routes in Rails. (e.g. http://domain.com/forum/post/123 maps to http://domain.com/forum/post.php?id=123).

Chakrit also mentions Google Sitemaps. These are useful in ensuring Google scans every posting and keeps it in their index permanently. Jeff Atwood discusses this on the Stackoverflow podcast 24 in which he explains that Google wasn't keeping all the Stackoverflow posts until they put each one inside the sitemap.

Michael Glenn
A: 

The crawling bots don't crawl your whole site at once but some pages with each visit. The frequency of the crawls and number of pages crawled each time vary greatly with each site.

Each page indexed by Google is crawled again once in a while to make sure there are no changes.

Using a sitemap is definitely helpful to make sure the search engines index as many pages as possible.

allesklar