crawler

How to write a crawler?

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it fin...

.NET Does NOT Have Reliable Asynchronouos Socket Communication?

I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET. The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response co...

Save a deque in a text file

I am writing a crawler in Python, in order to make Ctrl+C not to cause my crawler to start over in next run, I need to save the processing deque in a text file (one item per line) and update it every iteration, the update operation needs to be super fast. In order not to reinvent the wheel, I am asking if there is an established module t...

Watin HELP: how do i get the list of button using watin?

Hi, How can i use Watin to get the list of available button on a website? How do the watinTestRecorder do it? thanks ...

Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it i...

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

I learned Why Request.Browser.Crawler is Always False in C# (http://www.digcode.com/default.aspx?page=ed51cde3-d979-4daf-afae-fa6192562ea9&article=bc3a7a4f-f53e-4f88-8e9c-c9337f6c05a0). Does anyone uses some method to dynamically update the Crawler's list, so Request.Browser.Crawler will be really useful? ...

Best way to store data for Greasemonkey based crawler?

I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue. What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages. My current plan is to open each profile in it's own tab, so that it looks more like a normal br...

How to best develop web crawlers

Heyall, I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to cle...

crawl robots and page size

was wondering, does a web site get affected in terms of search engine rank or results positioning, if its size is not optimized but instead it has average loading times comparing with the same type of websites. lets say No Cache: 289.0K Total size 35 HTTP requests ...

Robots.txt: allow only major SE

Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders? ...

How to crawl a feed

My application needs to keep track of RSS/Atom feeds and save the new entries in a database. My question is, What is the most reliable method to determine whether an entry in a feed has already been crawled or not? I use Universal Feed Parser module to parse the feeds. My current implementation keeps record of the latest value of feed.en...

WebBrowser.Refresh problem in VB.Net

I'm working on a webcrawler in VB.net, and using the System.Forms.WebBrowser object for handling navigation on sites that use javascript or form posts, but I'm having a problem. When I navigate backwards (WebBrowser.GoBack()) to a page that was loaded with a form post, the page has expired and I have to do a refresh to resend the reques...

What is the optimum duration for a web crawler to wait between repeated requests to a web server

Is there some standard time duration that a crawler must wait for between repeated hits to the same server...so as not to overburden the server. If not, any suggestions on what can be a good waiting period for the crawler to be considered polite. Does this value also vary from server to server... and if so how can one determine it. An...

is there a good web crawler library available for PHP or Ruby?

is there a good web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used. ...

is breadth first search or breadth first traversal possible without using a queue?

as i rememeber and checked, that the usual way for traversing a tree or crawling the web breadth first (BFS) is by using a queue. Is there actually a way to implement it not using a queue? ...

how to identify web crawlers of google/yahoo/msn by PHP?

AFAIK, $_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com". but is it the most ensuring method? any other way out? ...

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an ind...

Identifying hostile web crawlers

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site. Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the ...

building a url queue

Which is better for building a URL queue in large scale web crawler. Linked-list or or B-tree? ...

How does a crawler ensure a maximum coverage?

I read some articles on Web crawling and learnt the basics of crawling. According to them, the web crawlers just use the URLs retrieved by the other web pages and going through a tree (a mesh practically) In this case how does a crawler ensures the maximum coverage. Obviously there may be a lot of sites that don't have referral links f...