spider

What are the key considerations when creating a web crawler?

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community. I want to use a crawler to walk over "the web...

Tools to convert asp.net dynamic site into static site

Are there any tools that will spider an asp.net website and create a static site? ...

best library to do web-scraping

I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites? ...

How to write a crawler?

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it fin...

Automated screenshots?

I need a tool to make screenshots of every page on a rather large site so I'm looking for a tool that can (best case scenario) automatically spider the site and create screenshots of every page in a folder or (plan B) a browser plug-in that automatically takes a screenshot of every page I load/visit and saves it to my drive. ...

SEO for Ultraseek 5.7

We've got Ultraseek 5.7 indexing the content on our corporate intranet site, and we'd like to make sure our web pages are being optimized for it. Which SEO techniques are useful for Ultraseek, and where can I find documentation about these features? Features I've considered implementing: Make the title and first H1 contain the most...

Detecting 'stealth' web-crawlers

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about t...

What is the current level of XHTML support in browsers and search engine spiders?

Ignoring the IE case, are there any other browsers that can't understand the application/xhtml+xml content type? And what about the search engine spiders? I could not find any answers on the web that would not be a few years old and thus possibly inaccurate. Edit: Somehow related question: http://stackoverflow.com/questions/278746/what...

Recommendations for a spidering tool to use with Lucene or Solr?

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be. ...

Quickest way to get list of <title> values from all pages on localhost website

I essentially want to spider my local site and create a list of all the titles and URLs as in: http://localhost/mySite/Default.aspx My Home Page http://localhost/mySite/Preferences.aspx My Preferences http://localhost/mySite/Messages.aspx Messages I'm running Windows. I'm open to anything that works--a C# console app, Powe...

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page. <meta name="robots" content="in...

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution. The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plai...

How would someone download a website from Google Cache?

A friend accidentally deleted his forum database. Which wouldn't normally be a huge issue, except for the fact that he neglected to perform backups. 2 years of content is just plain gone. Obviously, he's learned his lesson. The good news, however, is that Google keeps backups, even if individual site owners are idiots. The bad news is, ...

Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it i...

Creating a simple 'spider'

I have researched on spidering and think that it is a little too complex for quite a simple app I am trying to make. Some data on a web page is not available to view in the source as it is just being displayed by the browser. If I wanted to get a value from a specific web page that I was to display in a WebBrowser control, is there any...

What do I do if a search engine spider is hammering my site?

I run a small webserver, and lately it's been getting creamed by a search engine spider. What's the proper way to cool it down? Should I send it 5xx responses periodically? Is there a robots.txt setting I should be using? Or something else? ...

Processing web feed multiple times a day

Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algor...

Extracting meaning full content from web pages

I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no...

How to execute a PHP spider/scraper but without it timing out

Hey guys/girls, Basically I need to get around max execution time. I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to. The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser ...

Robots.txt: allow only major SE

Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders? ...