I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.
I want to use a crawler to walk over "the web...
Are there any tools that will spider an asp.net website and create a static site?
...
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?
...
Hi All,
I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.
Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it fin...
I need a tool to make screenshots of every page on a rather large site so I'm looking for a tool that can (best case scenario) automatically spider the site and create screenshots of every page in a folder or (plan B) a browser plug-in that automatically takes a screenshot of every page I load/visit and saves it to my drive.
...
We've got Ultraseek 5.7 indexing the content on our corporate intranet site, and we'd like to make sure our web pages are being optimized for it.
Which SEO techniques are useful for Ultraseek, and where can I find documentation about these features?
Features I've considered implementing:
Make the title and first H1 contain the most...
What options are there to detect web-crawlers that do not want to be detected?
(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)
I'm not talking about t...
Ignoring the IE case, are there any other browsers that can't understand the application/xhtml+xml content type? And what about the search engine spiders?
I could not find any answers on the web that would not be a few years old and thus possibly inaccurate.
Edit:
Somehow related question: http://stackoverflow.com/questions/278746/what...
What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.
...
I essentially want to spider my local site and create a list of all the titles and URLs as in:
http://localhost/mySite/Default.aspx My Home Page
http://localhost/mySite/Preferences.aspx My Preferences
http://localhost/mySite/Messages.aspx Messages
I'm running Windows. I'm open to anything that works--a C# console app, Powe...
Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="in...
I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plai...
A friend accidentally deleted his forum database. Which wouldn't normally be a huge issue, except for the fact that he neglected to perform backups. 2 years of content is just plain gone. Obviously, he's learned his lesson.
The good news, however, is that Google keeps backups, even if individual site owners are idiots. The bad news is, ...
I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it i...
I have researched on spidering and think that it is a little too complex for quite a simple app I am trying to make. Some data on a web page is not available to view in the source as it is just being displayed by the browser.
If I wanted to get a value from a specific web page that I was to display in a WebBrowser control, is there any...
I run a small webserver, and lately it's been getting creamed by a search engine spider. What's the proper way to cool it down? Should I send it 5xx responses periodically? Is there a robots.txt setting I should be using? Or something else?
...
Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algor...
I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content.
To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no...
Hey guys/girls,
Basically I need to get around max execution time.
I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to.
The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser ...
Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
...