How does crawler or spider in a search engine works
A:
From How Stuff Works
How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.
aioobe
2010-05-05 11:38:21
A:
Specifically, you need at least some of the following components:
- Configuration: Needed to tell the crawler how, when and where to connect to documents; and how to connect to the underlying database/indexing system.
- Connector: This will create the connections to a web page or a disk share or anything, really.
- Memory: The pages already visited must be known to the crawler. This is usually stored in the index but it depends on the implementation and the needs. The content is also hashed for de-duplication and updates validation purposes.
- Parser/Converter: Needed to be able to understand the content of a document and extract meta-data. Will convert the extracted data to a format usable by the underlying database system.
- Indexer: Will push the data and meta-data to an database/indexing system.
- Scheduler: Will plan runs of the crawler. Might need to handle a large number of running crawlers at the same time and take into consideration what is currently being done.
- Connection algorithm: When the parser finds links to other documents, it is needed to analyse when, how, and where the next connections must be made. Also, some indexing algorithm take into consideration the page connection graphs so it might be needed to store and sort information related to that.
- Policy Management: Some sites requires crawlers to respect certain policies (robots.txt for example).
- Security/User Management: The crawler might need to be able to login in some system to access data.
- Content compilation/execution: The crawler might need to execute certain things to be able to access what's inside, like applets/plugins.
Crawlers needs to be efficient at working together from different starting points, speed, memory usage and using a high number of threads/processes. I/O is key.
Loki
2010-05-19 12:09:51