web-crawler

What are the key considerations when creating a web crawler?

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community. I want to use a crawler to walk over "the web...

How to set up a robot.txt which only allows the default page of a site

Say I have a site on http://website.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://website.com & http://website.com/ should be allowed, but http://website.com/anything and http://website.com/someendpoint.ashx should be blocked. Further...

keep rsync from removing unfinished source files

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that...

How do you turn a dynamic site into a static site that can be demo'd from a CD?

I need to find a way to crawl one of our company's web applications and create a static site from it that can be burned to a cd and used by traveling sales people to demo the web site. The back end data store is spread across many, many systems so simply running the site on a VM on the sale person's laptop won't work. And they won't have...

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

After working with .NET's HttpWebRequest/Response objects, I'd rather shoot myself than use this to crawl through web sites. I'm looking for an existing .NET library that can fetch URLs, and give you the ability to follow links, extract/fill in/submit forms on the page, etc. Perl's LWP and WWW::Mechanize modules do this very well, but ...

Building a web crawler - using Webkit packages

I'm trying to build a web crawler. I need 2 things: Convert the HTML into a DOM object. Execute existing Javascripts on demand. The result I expect is a DOM Object, where the Javascript that executes on-load is already executed. Also, I need an option to execute on demand additional Javascripts (on events like: onMouseOver, onMouseClick...

What's a good Web Crawler tool

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing. ...

Prevent site data from being crawled and ripped

I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search. What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together. For example,...

HttpBrowserCapabilities.Crawler property .NET

How does the HttpBrowserCapabilities.Crawler property (http://msdn.microsoft.com/en-us/library/aa332775(VS.71).aspx) work? I need to detect a partner's custom crawler and this property is returning false. Where/How can I add his user agent so that this property will return true? Any other way outside of creating my own user agent de...

Can I block search crawlers for every site on an Apache web server?

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed. Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers? Changing the robots.txt wouldn't really work since I use scrip...

Detecting 'stealth' web-crawlers

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about t...

Recommendations for a spidering tool to use with Lucene or Solr?

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be. ...

Protect Email on Web Site From Robots and Crawlers

Closed as duplicate of http://stackoverflow.com/questions/308772/what-are-some-ways-to-protect-emails-on-websites-from-spambots I am finally puting up my personal web site. I want to publish a webmaster/feedback email on every page, but I am concerned about SPAM crawlers extracting the email address and bombarding me. This is especia...

Crawler/parser for Xapian

I would like to implement a search engine which should crawl a set of web sites, extract specific information from the pages and create full-text index of that specific information. It seems to me that Xapian could be a good choice for the search engine library. What are the options for a crawler/parser to integrate with Xapian? Would...

Web crawler links/page logic in PHP

I'm writing a basic crawler that simply caches pages with PHP. All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a> - at the moment it returns: Array { [url] => URL [desc] => DESCRIPTION } The problem I'm having is figuring out the logic behind determining wh...

Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it i...

What are the best prebuilt libraries for doing Web Crawling in Python

I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site. Are there existing free libraries to get me there? I've seen Chilkat, but it's for pay. I'm just looking for baseline function...

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

I learned Why Request.Browser.Crawler is Always False in C# (http://www.digcode.com/default.aspx?page=ed51cde3-d979-4daf-afae-fa6192562ea9&amp;article=bc3a7a4f-f53e-4f88-8e9c-c9337f6c05a0). Does anyone uses some method to dynamically update the Crawler's list, so Request.Browser.Crawler will be really useful? ...

robots.txt: disallow all but a select few, why not?

I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site. The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there. My questions are: Is there any reason not to? Has anybody done this? Did you notice any negative e...

How to best develop web crawlers

Heyall, I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to cle...