web-crawler

what do you call a spidering technique where spider visits all links in the first level, and all links in second level.

i forgot the name to a case where a web spider will first visits all links it sees on the first level. then visits all links it sees on the second level. and so on... there is a name for this technique....i forgot... anyways, this is very exhaustive and obviously inefficient. Is there a better way ? I remember reading a paper in su...

How to allow crawlers access to index.php only, using robots.txt ?

If i want to only allow crawlers to access index.php, will this work? User-agent: * Disallow: / Allow: /index.php ...

In which programming language is the Googlebot written (or any other efficient web-crawler)?

Does anyone know in which programming language the Googlebot was written? Or, more generally, in which language are efficient web-crawlers written? I've seen many in Java language, but it doesn't seem to me the most appropriate language to develop a web-crawler because it creates far too much overhead (tried with Heritrix web-crawler, ...

where can i find papers on Web Spiders and AI ?

im interested in finding algorithms or appraoches in developing spiders which follow some AI or crawling model highlighted in computer science papers.. where can i find such papers? ...

Which web crawler for extracting and parsing data from about a thousand of web sites

I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only. Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in. I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Herit...

Bot Web Quality

I am looking for a good open source bot to determine some quality, often required for google indexing. For example find duplicate titles invalid links ( jspider do this, and I think a lot more will do this) exactly the same page, but different urls etc, where etc equals google quality reqs. ...

Why does Googlebot traverse a newly added site in ascending order of URL-length?

Googlebot (Googlebot/2.1) appears to crawl URL:s on a newly added sites in an order corresponding to the length of the URL: .. GET /ivjwiej/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ .. .. GET /voeoovo/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ .. .. GET /zeooviee/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googl...

How to generate graphical sitemap of large website

Hello, I would like to generate a graphical sitemap for my website. There are two stages, as far as I can tell: crawl the website and analyse the link relationship to extract the tree structure generate a visually pleasing render of the tree Does anyone have advice or experience with achieving this, or know of existing work I can ...

Not crawling the same content twice

I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified. Now , in the first pass I crawl all the pages in the site. But next, the paged content of that site - I don't want to re-crawl all of it , just the latest addit...

Crawling news articles

Does anyone know if there are standards / api to crawl news articles from most of the biggest news sources. I'm using rss to index them but I would like to classify them with more data than just their titles. ...

Most optimized way to store crawler states ?

Hi there, I'm currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are. Thus, I'm able to fetch those links (obviousl...

Is there a list of known web crawlers?

I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of know web crawlers with so...

What is the best way to crawl a login based sites?

I've to automate a file download activity from a website (similar to, let's say, yahoomail.com). To reach a page which has this file download link, i've to login, jump from page to page to provide some parameters like dates etc., and finally click on download link. I am thinking of three approaches: Using WatIN and develop a windows s...

how do web crawlers handle javascript

Today a lot of content on Internet is generated using JavaScript (specifically by background AJAX calls). I was wondering how web crawlers like Google handle them. Are they aware of JavaScript? Do they have a built-in JavaScript engine? Or do they simple ignore all JavaScript generated content in the page (I guess quite unlikely). Do peo...

webcrawler in java or with j2ee as my final year project is good?How can I improve searchengine via webcrawler? What are the other features the best crawler have?

This is for my final year project. I want to design a web crawler, that crawls the pages with set of urls I have given. I want to know that How can I improve searchengine via webcrawler? I choosed java as my developing language.Is that good? What are the other features the webcrawlers have? What are the other ways it can help us? What ...

Outgoing load balancer

I have a big threaded feed retrieval script in python. My question is, how can I load balance outgoing requests so that I don't hit any one host too often? This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on the...

where I to put file robots.txt?

I have the domain www.mydomain.com and I set apache mod-rewrite so as to have www.mydomain.com/myappl. Where should I place the file robots.txt? thanks! ...

Creating a web indexer in Java?

I'm supposed to write a web crawler in Java. The crawling part is easy, but the indexing part is difficult. I need to be able to query the indexer and have it return matches (multiple word queries). What would be the best data structure for doing such a thing? ...

Writing pseudo-crawler for web statistics.

I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "A...

Getting a list of all churches in a certain state using Python.

Hi, I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street,...