web-crawler

What is a good Java web crawler library?

Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of ...

How to "merge" page "\Default.aspx" and "\"?

our site is developed in ASP.NET. We want to block Default.aspx page from Google and other search engines. How can we "close" the Default.aspx page so that it is not accessible? Or is there another way to solve the problem so that we don't create duplicate content. ...

Writing Crawler for Screen Scrapping

I want to write crawler for screen scrapping What I want is, I want to get price of particular hotel from a website, like here is website e.g. In the above URL, there is list of hotels and its price. I want to get the price of the beaufort Please Advise how to accomplish this. Thanks ...

Mining Groups of people from Wikipedia

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ? Or is there any other alternative to get t...

Nutch - how to crawl by small patches?

Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. C...

does google crawl flash

i have one domain link text i want to know that does google crawl flash like in the intro of mentioned website thanks ...

Appengine Apps Vs Google bot web crawler

i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root site a long back. but its of no use. c...

Help on preg_match pattern

I want to parse a html content that have something like this: <div id="sometext">Lorem<br> <b>Ipsun</b></div><span>content</span><div id="block">lorem2</div> I need to catch just the "Lorem<br> <b>Ipsun</b>" inside the first div. How can I achieve this? Ps: the html inside the first div have multiple lines, its an article. Th...

Is there anyway of making json data readable by a Google spider?

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choic...

is it possible to extract all PDFs from a site

given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist? I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this? also, assume that the site does not let you visit som...

web crawler needed

does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions? thanks ...

Which metadata I should save when downloading web-pages?

Hi, I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important. <title> <link> <publish_date> <date_downloaded> <source> // to this page <keyword> // for Solr indexing <text> // cleaned b...

How do web crawlers affect site statistics?

What are ways in which web crawlers (both from search engines and non-search engines) could affect site statistics (e.g., when doing AB-testing different page variations)? And what are ways to take care of these problems? For example: Do a lot of people writing web crawlers often delete their cookies and mask their IPs, so that web cr...

TypeError: coercing to Unicode: need string or buffer, User found

hi, i have to crawl last.fm for users (university exercise). I'm new to python and get following error: Traceback (most recent call last): File "crawler.py", line 23, in <module> for f in user_.get_friends(limit='200'): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", li...

Web crawler that can interpret javascript

Hi, I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can sav...

Getting web page after calling DownloadStringAsync()?

Hello I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine? Imports System.Net Public Class Form1...

Where Googlebot starts crawling?

Say if I register a domain and have developed it into a complete website. From where and how Googlebot knows that the new domain is up? Does it always start with the domain registry? If it starts with the registry, does that mean that anyone can have complete access to the registry's database? Thanks for any insight. ...

WebCrawling Dynamic Links

Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise. ...

Robots.txt with one site but two domains

I have a website which has two domains added. Both domains point to the root of the website. Is it possible to alter the robots.txt so that one of the domains doesn't get crawled, while the other still does? ...

SEO Problem for new dictionary site, google hasn't indexed content.

I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not ...