What is the best practice and library I can use to key in search textbox on external website and collect the search result?
How do tackle website with different search box and checkbox and collect the result?
Can Selenium be used to automate this?
Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...
I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up?...
Is it possible to crawl billions of pages on a single server?
...
Hi all,
I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can...
Our Situation:
Our team needs to retrieve log information from a 3rd party website (Specifically, this log
information is call logs -- our client rents an 866 number. When calls come in, they assist
people and need to make notes accordingly in our application that will correspond with the
current call). Our client has a web account with ...
Three months ago I published my small personal website (~10 pages), submitted the URL to Google, and a few days later Googlebot showed up. Over the course of the last couple of weeks, Googlebot visits my website approximately twice a week and crawls maybe every other page.
Ever since Googlebot first crawled my website, whenever I run a...
I'm looking for software (preferably a free .Net library) that will accept a url and determine its genre/purpose automatically. E.g. www.cnn.com = news, www.google.com = search engine. I would imagine something like that exists, and functions by either scraping the site and analyzing its content, or by simply comparing it to a master lis...
There's a way of excluding complete page(s) from google's indexing. But is there a way to specifically exclude certain part(s) of a web page from google's crawling? For example, exclude the side-bar which usually contains unrelated contents?
...
Hi all,
I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between Facebook users. Is it possible to write a crawler for Facebook based on friendship information?
I looked around but couldn't find any things useful so far. It seems Facebook isn't fond of such activit...
In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this?
...
Is it possible to write a C# program that will load up a webpage, pass the webform parameters to login, then click on a link and download the page information? Obviously, I'd be supplying the username and password.
In context, let's say I want to check if there are any new news updates on my school account, which I must login to with m...
I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do?
Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause.
Restriction...
Hi,
I'm making a little bot to crawl a few websites.
Now, I'm just testing it out right now and I tried 2 types of settings :
about 10 requests every 3 seconds - the IP got banned, so I said - ok , that's too fast.
2 requests every 3 seconds - the IP got banned after 30 minutes and 1000+ links crawled .
Is that still too fast ? I ...
i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps)
.
I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated...
I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
partly just to read the feeds of several sites
to scrap the content of these sites
if the site has an archive I would like to crawl and index it as we...
The page view counter on each MediaWiki page seems like a great way to identify popular pages which are worth putting more effort into keeping up-to-date and useful, but I've hit a problem.
We use a Google Search Appliance to index our MediaWiki installation. The problem I have is that the GSA increments the page view counter each time ...
I am looking for developing a Web Scrapper (in C# windows forms).The whole idea which i am trying to accomplish is as follows.
Get the URL from the User .
Load the Web page , in the IE UI control(embeddeed browser) in WINForms.
Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
When ...
Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?
Thanks in advance!
...
okay so exhaustive depth first crawl is not efficient visiting all links. i am looking for a library or algorithm that can improve the efficiencies of crawling relevant pages. so ignoring any repetitive or pages with few uniqueness.
...
Hey,
I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the us...