web-crawler

How can I make a web crawler gather data?

I know its a big question but I'm a complete beginner. I've limited experience in HTML, PHP etc and want to knock something together but don't even know where to start. Although I can't necessarily program every language, with a little guidance I do a mean cut and paste and can learn anything. I'm a school teacher so I have a long summ...

Web crawlers and non-ASCII characters in sitemap.xml

Hi! One of our sites has non-ASCII (non-english) characters in URLs: http://example.com/kb/начало-работы/оплата I wonder how do web crawlers (particularly Googlebot) handle these situations? Do these URLs have to be encoded or otherwise processed? ...

Blocking Web Scrapers

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot? ...

Testing a website using C#

Folks, I need to accomplish some sophisticated web crawling. The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page. What is the best approach? Some Unit testing 3rd party lib? Manual crawling in C#? Maybe there is a ready lib for that specific...

Dynamic Content & SEO: Create 2 Separate Pages?

On a website, there are many pages with a component for users to leave comments. To reduce page load time and since few users use the commenting system, the commenting component is loaded via AJAX after the page is loaded. The issue: how can we get Google to index dynamic content that is loaded via AJAX on page load? Many other pages on...

Is there a way I can compile Javascript using my C# code?

Hey Guys, Iam building a scrapper which needs to scrap some web content. Iam facing an issue, the page I need to crawl has loads of java scripts and it seems that the java-script calls are setting up some cookies and some query string parameters for next requests. Iam able to set the cookies by sending requests to the js files, but see...

php find external links and get data

Possible Duplicate: Finding and Printing all Links within a DIV I'm trying to make a mini crawler..when i specify a site.. it does file_get_contents()..then get the data i want.. which i've already done.. now i want to add code that enables it to find..any external links on the site it is on.. and get the data .. basically..i...

HtmlUnit - ElementNotFound Exception

I'm using HtmlUnit [see this], and I bumped into a weird problem: I'm trying to call a page, click a button and retrieve the subsequent page. It works fine, but sometimes it bumps out with ElementNotFoundException when I try setting the value attribute for a field in the retrieved page. I tried adding a Sleep(1000), but it doesn't help....

Monitor which thread is downloading url

I have an application that downloading urls using threadPool in different threads, but recently I've read an article (http://www.codeproject.com/KB/IP/Crawler.aspx) that it says HttpWebRequest.GetResponse() is working only in one thread and the other threads is waiting for that thread. first I want to know is it true ? how can i monitor ...

C# Webbrowser Control: Navigating to a List to URLs

Hi all I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML. I used the following logic foreach (string href in hrefs) { webBrowser1.Url = new Uri(hre...

Need to extract bibtex from Citeseer or any academic article search engine/database

Hi Everyone, I'm trying to extract bibtex information from citeseer to build an XML file containing data related to my search query and the the cited articles by the result set. I have already built a program which read the source of the URL and extract information. But this mechanism is giving some errors in the results, where it doe...

What did Index Means in DB?

Hey Friends Whats does "indexing" mean? How it is useful to a web crawler? ...

Facebook fanpage crawler

I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds. I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db. Is there any better way to do this? Any help is appreciable ...

Python Web Crawlers and "getting" html source code

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If y...

Detect spiders or browsers with cookies enabled

Lots of spiders/crawlers visit our news site. We depend on GeoIP services to identify our visitors' physical location and serve them related content. So we developed a module with module_init() function that sends IP to MaxMind and sets cookies with location information. To avoid sending requests on each page view, we first check wheth...

Good library/platform for a real-time/parallel HTTP crawler?

I am building a crawler that fetches information in parallel from a number of websites in real-time in response to a request for this information from a client. I need to request specific pages from 10-20 websites, parse their contents for specific snippets of information and return this information to the client as fast as possible. I w...

does a PHP redirection affects the way a crawler or a robot views a website?

for example if in my index.php i have something like: <?php header('Location: /mypublicsite/index.php'); ?> what do the crawlers and/or robots get? just a blank page? or they actually arrive to /mypublicsite/index.php? ...

keep web crawlers out of your site

Is there any way in web development to ensure that web crawlers cannot crawl your website? ...

Crawler Coding: determine if pages have been crawled?

I am working on a crawler in PHP that expects m URLs at which it finds a set of n links to n pages (internal pages) which are crawled for data. Links may be added or removed from the n set of links. I need to keep track of the links/pages so that i know which have been crawled, which ones are removed and which ones are new. How should i...

java to know all subUrls of url

Hi! I have an url. How to know all the existed subUrls of this page. For example, http://tut.by/car/12324 - exists ................/car/66666 - doesn`t exist Desirably, in java. I have already experimented with almost all from java-source.net/open-source/crawlers - no one can do that, they can only go by hrefs. Thx in advance! ...