views:

186

answers:

3

I read some articles on Web crawling and learnt the basics of crawling. According to them, the web crawlers just use the URLs retrieved by the other web pages and going through a tree (a mesh practically)

In this case how does a crawler ensures the maximum coverage. Obviously there may be a lot of sites that don't have referral links from other pages/sites. Does the search engines follow any other mechanisms other than crawling and manual registration? (i.e getting information from domain registries)

If they are just based on crawling, How should we select a good set of "Root" sites to begin crawling? (We have no way to predict the results. If we select 100 sites with no referel links the engine will come up with just 100 sites + their inner pages)

+1  A: 

One method used to help crawlers is a "sitemap." The sitemap is basically a file that lists out the contents of the website, so that the crawler knows where to navigate, especially if your site has dynamic content. A more accurate sitemap will greatly increase the accuracy of a crawler.

Here's some info on the Google sitemap:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40318

Andy White
For a solution using a simple standard like robots.txt look at http://www.sitemaps.org/
Yes. Sitemap is useful while traversing in inner pages of a given site. But how we know the Site "Home" to get the sitemap?
Chathuranga Chandrasekara
it should always be in the root page, named `sitemap.xml`: http://www.example.com/sitemap.xml
Tim McNamara
+1  A: 

There're no magic mechanism that would allow a crawler to find a site not referred to by any other site already crawled or not being manually added to the crawler.

The crawler only traverses the graph of links starting with a set of manually registered - and therefore predefined - roots. Everything that is off the graph will be unreachable to the crawler - it will have no means for finding this content.

sharptooth
"Everything that is off the graph will be unreachable to the crawler - it will have no means for finding this content." We can still have excellent sites but with no referrals or referrals from indexed sites.
Chathuranga Chandrasekara
An excellent site with no referral from indexed sites, is not an excellent site.
Emre
+2  A: 

Obviously there may be a lot of sites that don't have referral links from other pages/sites.

I don't think this really is as big a problem as you think.

Does the search engines follow any other mechanisms other than crawling and manual registration? (i.e getting information from domain registries)

None that I heard of.

If they are just based on crawling, How should we select a good set of "Root" sites to begin crawling?

Any kind of general-purpose web directory like the open directory project would be an ideal candidate, as would social bookmark sites like Digg or del.icio.us

Michael Borgwardt