I know its a big question but I'm a complete beginner. I've limited experience in HTML, PHP etc and want to knock something together but don't even know where to start.
Although I can't necessarily program every language, with a little guidance I do a mean cut and paste and can learn anything. I'm a school teacher so I have a long summ...
Hi!
One of our sites has non-ASCII (non-english) characters in URLs:
http://example.com/kb/начало-работы/оплата
I wonder how do web crawlers (particularly Googlebot) handle these situations? Do these URLs have to be encoded or otherwise processed?
...
What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?
...
Folks,
I need to accomplish some sophisticated web crawling.
The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page.
What is the best approach?
Some Unit testing 3rd party lib?
Manual crawling in C#?
Maybe there is a ready lib for that specific...
On a website, there are many pages with a component for users to leave comments. To reduce page load time and since few users use the commenting system, the commenting component is loaded via AJAX after the page is loaded. The issue: how can we get Google to index dynamic content that is loaded via AJAX on page load?
Many other pages on...
Hey Guys,
Iam building a scrapper which needs to scrap some web content. Iam facing an issue, the page I need to crawl has loads of java scripts and it seems that the java-script calls are setting up some cookies and some query string parameters for next requests.
Iam able to set the cookies by sending requests to the js files, but see...
Possible Duplicate:
Finding and Printing all Links within a DIV
I'm trying to make a mini crawler..when i specify a site.. it does file_get_contents()..then get the data i want.. which i've already done.. now i want to add code that enables it to find..any external links on the site it is on.. and get the data ..
basically..i...
I'm using HtmlUnit [see this], and I bumped into a weird problem:
I'm trying to call a page, click a button and retrieve the subsequent page.
It works fine, but sometimes it bumps out with ElementNotFoundException when I try setting the value attribute for a field in the retrieved page.
I tried adding a Sleep(1000), but it doesn't help....
I have an application that downloading urls using threadPool in different threads, but recently I've read an article (http://www.codeproject.com/KB/IP/Crawler.aspx) that it says HttpWebRequest.GetResponse() is working only in one thread and the other threads is waiting for that thread. first I want to know is it true ? how can i monitor ...
Hi all
I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML.
I used the following logic
foreach (string href in hrefs)
{
webBrowser1.Url = new Uri(hre...
Hi Everyone,
I'm trying to extract bibtex information from citeseer to build an XML file containing data related to my search query and the the cited articles by the result set. I have already built a program which read the source of the URL and extract information.
But this mechanism is giving some errors in the results, where it doe...
Hey Friends
Whats does "indexing" mean? How it is useful to a web crawler?
...
I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds.
I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db.
Is there any better way to do this?
Any help is appreciable
...
So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If y...
Lots of spiders/crawlers visit our news site. We depend on GeoIP services to identify our visitors' physical location and serve them related content. So we developed a module with module_init() function that sends IP to MaxMind and sets cookies with location information. To avoid sending requests on each page view, we first check wheth...
I am building a crawler that fetches information in parallel from a number of websites in real-time in response to a request for this information from a client. I need to request specific pages from 10-20 websites, parse their contents for specific snippets of information and return this information to the client as fast as possible. I w...
for example if in my index.php i have something like:
<?php
header('Location: /mypublicsite/index.php');
?>
what do the crawlers and/or robots get? just a blank page? or they actually arrive to /mypublicsite/index.php?
...
Is there any way in web development to ensure that web crawlers cannot crawl your website?
...
I am working on a crawler in PHP that expects m URLs at which it finds a set of n links to n pages (internal pages) which are crawled for data. Links may be added or removed from the n set of links. I need to keep track of the links/pages so that i know which have been crawled, which ones are removed and which ones are new.
How should i...
Hi!
I have an url.
How to know all the existed subUrls of this page.
For example,
http://tut.by/car/12324 - exists
................/car/66666 - doesn`t exist
Desirably, in java.
I have already experimented with almost all from java-source.net/open-source/crawlers - no one can do that, they can only go by hrefs.
Thx in advance!
...