web-crawler

crawler get external website search result

What is the best practice and library I can use to key in search textbox on external website and collect the search result? How do tackle website with different search box and checkbox and collect the result? Can Selenium be used to automate this? Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...

Rotating Proxies for web scraping

I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up?...

How to crawl billions of pages?

Is it possible to crawl billions of pages on a single server? ...

Crawling engine architecture - Java/ Perl integration

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can...

Retrieving HTML pages from a 3rd party log in website with ASP.NET

Our Situation: Our team needs to retrieve log information from a 3rd party website (Specifically, this log information is call logs -- our client rents an 866 number. When calls come in, they assist people and need to make notes accordingly in our application that will correspond with the current call). Our client has a web account with ...

Why doesn't Googlebot index pages it crawls?

Three months ago I published my small personal website (~10 pages), submitted the URL to Google, and a few days later Googlebot showed up. Over the course of the last couple of weeks, Googlebot visits my website approximately twice a week and crawls maybe every other page. Ever since Googlebot first crawled my website, whenever I run a...

Is there software available that will automatically categorize a website?

I'm looking for software (preferably a free .Net library) that will accept a url and determine its genre/purpose automatically. E.g. www.cnn.com = news, www.google.com = search engine. I would imagine something like that exists, and functions by either scraping the site and analyzing its content, or by simply comparing it to a master lis...

How to exclude part of a web page from google's indexing?

There's a way of excluding complete page(s) from google's indexing. But is there a way to specifically exclude certain part(s) of a web page from google's crawling? For example, exclude the side-bar which usually contains unrelated contents? ...

Facebook crawler?

Hi all, I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between Facebook users. Is it possible to write a crawler for Facebook based on friendship information? I looked around but couldn't find any things useful so far. It seems Facebook isn't fond of such activit...

Why should Ruby not be used to create a spider

In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this? ...

Is it possible to programatically login to a website with C#?

Is it possible to write a C# program that will load up a webpage, pass the webform parameters to login, then click on a link and download the page information? Obviously, I'd be supplying the username and password. In context, let's say I want to check if there are any new news updates on my school account, which I must login to with m...

Legality, terms of service for performing a web crawl

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restriction...

Too aggressive bot ?

Hi, I'm making a little bot to crawl a few websites. Now, I'm just testing it out right now and I tried 2 types of settings : about 10 requests every 3 seconds - the IP got banned, so I said - ok , that's too fast. 2 requests every 3 seconds - the IP got banned after 30 minutes and 1000+ links crawled . Is that still too fast ? I ...

guide on crawling the entire web ?

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated...

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly just to read the feeds of several sites to scrap the content of these sites if the site has an archive I would like to crawl and index it as we...

How can I get MediaWiki to ignore page views from a Google Search Appliance?

The page view counter on each MediaWiki page seems like a great way to identify popular pages which are worth putting more effort into keeping up-to-date and useful, but I've hit a problem. We use a Google Search Appliance to index our MediaWiki installation. The problem I have is that the GSA increments the page view counter each time ...

Logic for Implementing a Dynamic Web Scraper in C#

I am looking for developing a Web Scrapper (in C# windows forms).The whole idea which i am trying to accomplish is as follows. Get the URL from the User . Load the Web page , in the IE UI control(embeddeed browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When ...

Database for web crawler in python?

Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project? Thanks in advance! ...

any web crawling algorithm or library that is able to get relevant pages and ignore noise.

okay so exhaustive depth first crawl is not efficient visiting all links. i am looking for a library or algorithm that can improve the efficiencies of crawling relevant pages. so ignoring any repetitive or pages with few uniqueness. ...

Prevent bot from crawling certain areas of site.

Hey, I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the us...