any web crawling algorithm or library that is able to get relevant pages and ignore noise.

tags:

web-crawler

views:

answers:

any web crawling algorithm or library that is able to get relevant pages and ignore noise.

okay so exhaustive depth first crawl is not efficient visiting all links. i am looking for a library or algorithm that can improve the efficiencies of crawling relevant pages. so ignoring any repetitive or pages with few uniqueness.

related questions

Web crawlers and Google App Engine Hosted applications

Detecting CacheBuster querystrings when crawling a page

How to prevent robots.txt passing from staging env to production?

Detecting honest web crawlers

How to force a page to be removed from the search engine index?

How to best develop web crawlers

robots.txt: disallow all but a select few, why not?

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

What are the best prebuilt libraries for doing Web Crawling in Python

Anyone know of a good Python based web crawler that I could use?

Web crawler links/page logic in PHP

Crawler/parser for Xapian

Protect Email on Web Site From Robots and Crawlers

Recommendations for a spidering tool to use with Lucene or Solr?

Detecting 'stealth' web-crawlers

Can I block search crawlers for every site on an Apache web server?

HttpBrowserCapabilities.Crawler property .NET

Prevent site data from being crawled and ripped

What's a good Web Crawler tool

Building a web crawler - using Webkit packages

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

How do you turn a dynamic site into a static site that can be demo'd from a CD?

keep rsync from removing unfinished source files

How to set up a robot.txt which only allows the default page of a site

What are the key considerations when creating a web crawler?

ansaurus

tags:

views:

answers:

any web crawling algorithm or library that is able to get relevant pages and ignore noise.

related questions