I read some articles on Web crawling and learnt the basics of crawling. According to them, the web crawlers just use the URLs retrieved by the other web pages and going through a tree (a mesh practically)
In this case how does a crawler ensures the maximum coverage. Obviously there may be a lot of sites that don't have referral links from other pages/sites. Does the search engines follow any other mechanisms other than crawling and manual registration? (i.e getting information from domain registries)
If they are just based on crawling, How should we select a good set of "Root" sites to begin crawling? (We have no way to predict the results. If we select 100 sites with no referel links the engine will come up with just 100 sites + their inner pages)