Hi All,
I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.
Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it fin...
I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET.
The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response co...
I am writing a crawler in Python, in order to make Ctrl+C not to cause my crawler to start over in next run, I need to save the processing deque in a text file (one item per line) and update it every iteration, the update operation needs to be super fast. In order not to reinvent the wheel, I am asking if there is an established module t...
Hi,
How can i use Watin to get the list of available button on a website?
How do the watinTestRecorder do it?
thanks
...
I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it i...
I learned Why Request.Browser.Crawler is Always False in C# (http://www.digcode.com/default.aspx?page=ed51cde3-d979-4daf-afae-fa6192562ea9&article=bc3a7a4f-f53e-4f88-8e9c-c9337f6c05a0).
Does anyone uses some method to dynamically update the Crawler's list, so Request.Browser.Crawler will be really useful?
...
I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue.
What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages.
My current plan is to open each profile in it's own tab, so that it looks more like a normal br...
Heyall,
I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.
The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to cle...
was wondering, does a web site get affected in terms of search engine rank or results positioning, if its size is not optimized but instead it has average loading times comparing with the same type of websites. lets say
No Cache:
289.0K Total size
35 HTTP requests
...
Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
...
My application needs to keep track of RSS/Atom feeds and save the new entries in a database. My question is, What is the most reliable method to determine whether an entry in a feed has already been crawled or not? I use Universal Feed Parser module to parse the feeds. My current implementation keeps record of the latest value of feed.en...
I'm working on a webcrawler in VB.net, and using the System.Forms.WebBrowser object for handling navigation on sites that use javascript or form posts, but I'm having a problem. When I navigate backwards (WebBrowser.GoBack()) to a page that was loaded with a form post, the page has expired and I have to do a refresh to resend the reques...
Is there some standard time duration that a crawler must wait for between repeated hits to the same server...so as not to overburden the server.
If not, any suggestions on what can be a good waiting period for the crawler to be considered polite.
Does this value also vary from server to server... and if so how can one determine it.
An...
is there a good web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used.
...
as i rememeber and checked, that the usual way for traversing a tree or crawling the web breadth first (BFS) is by using a queue. Is there actually a way to implement it not using a queue?
...
AFAIK,
$_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com".
but is it the most ensuring method?
any other way out?
...
Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever).
How would I do that?
Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)?
Then have an ind...
I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the ...
Which is better for building a URL queue in large scale web crawler. Linked-list or or B-tree?
...
I read some articles on Web crawling and learnt the basics of crawling. According to them, the web crawlers just use the URLs retrieved by the other web pages and going through a tree (a mesh practically)
In this case how does a crawler ensures the maximum coverage. Obviously there may be a lot of sites that don't have referral links f...