Scraping websites in Java

views:

answers:

+1 Q:

Scraping websites in Java

What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose. I looked into heritrix, but this seems like way more than I need. Is there a simpler tool that will provide information about robots.txt and scrape site accordingly?

(Also, I don't need to follow additional links and build up a deep index, I just need to index the individual pages in the list.)

+1 A:

you could just take the class you are interested in ie http://crawler.archive.org/xref/org/archive/crawler/datamodel/Robotstxt.html

Xavier Combelle 2010-07-07 18:22:09

I was kinda of hoping for something that did a bit more all in one package. It is possible that heritrix is the right thing for the job - maybe I just need a little more direction.

twofivesevenzero 2010-07-07 19:00:54

It's hard to answer what do you means exactly by index? If it's just download it. the class URL and the method openConnection is done for that. See http://download.oracle.com/docs/cd/E17476_01/javase/1.4.2/docs/api/java/net/URL.html#openConnection%28%29

Xavier Combelle 2010-07-08 11:53:02

I am looking to do a bit more than just download it. I would like to be able to check whether this is page that is "meaningful" (i.e. it is not behind a paywall or a login screen, etc), then download the html, and finally extract the plaintext for indexing. The biggest problem right now is figuring out if the page is meaningful.

twofivesevenzero 2010-07-09 03:22:04

This actually ended up working very well. Created a Robotstxt object and then called getDirectivesFor("<some-bot-type>").allows(url.getPath());

twofivesevenzero 2010-09-16 06:43:16

ansaurus

tags:

views:

answers:

Scraping websites in Java

related questions