Hi,
I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of ...
our site is developed in ASP.NET. We want to block Default.aspx page from Google and other search engines. How can we "close" the Default.aspx page so that it is not accessible?
Or is there another way to solve the problem so that we don't create duplicate content.
...
I want to write crawler for screen scrapping
What I want is, I want to get price of particular hotel from a website, like here is
website
e.g. In the above URL, there is list of hotels and its price. I want to get the price of the beaufort
Please Advise how to accomplish this.
Thanks
...
I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.
How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ?
Or is there any other alternative to get t...
Hi everyone!
I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do:
Start to crawl my seeds with
possibility to go further on
outlinks.
Crawl 20000 pages, then
index them.
C...
i have one domain
link text
i want to know that does google crawl flash like in the intro of mentioned website
thanks
...
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's
linked in it, But even after a long time of my launch, no pages are indexed on google.
Any base link place on my root site hover.in are being indexed with in minutes.
but i placed the same link home page of root site a long back. but its of no use.
c...
I want to parse a html content that have something like this:
<div id="sometext">Lorem<br> <b>Ipsun</b></div><span>content</span><div id="block">lorem2</div>
I need to catch just the "Lorem<br> <b>Ipsun</b>" inside the first div. How can I achieve this?
Ps: the html inside the first div have
multiple lines, its an article.
Th...
Is it possible to make JSON data readable by a Google spider?
Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choic...
given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist?
I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this?
also, assume that the site does not let you visit som...
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work.
ideally id like to store just the web addresses and which links that page contains
any suggestions?
thanks
...
Hi,
I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.
<title>
<link>
<publish_date>
<date_downloaded>
<source> // to this page
<keyword> // for Solr indexing
<text> // cleaned b...
What are ways in which web crawlers (both from search engines and non-search engines) could affect site statistics (e.g., when doing AB-testing different page variations)? And what are ways to take care of these problems?
For example:
Do a lot of people writing web crawlers often delete their cookies and mask their IPs, so that web cr...
hi,
i have to crawl last.fm for users (university exercise). I'm new to python and get following error:
Traceback (most recent call last):
File "crawler.py", line 23, in <module>
for f in user_.get_friends(limit='200'):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", li...
Hi, I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can sav...
Hello
I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI).
However, how can the asynchronous event handler actually return the web page to the calling routine?
Imports System.Net
Public Class Form1...
Say if I register a domain and have developed it into a complete website. From where and how Googlebot knows that the new domain is up? Does it always start with the domain registry?
If it starts with the registry, does that mean that anyone can have complete access to the registry's database? Thanks for any insight.
...
Hi Everyone,
Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.
...
I have a website which has two domains added. Both domains point to the root of the website. Is it possible to alter the robots.txt so that one of the domains doesn't get crawled, while the other still does?
...
I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not ...