views:

68

answers:

1

What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose. I looked into heritrix, but this seems like way more than I need. Is there a simpler tool that will provide information about robots.txt and scrape site accordingly?

(Also, I don't need to follow additional links and build up a deep index, I just need to index the individual pages in the list.)

+1  A: 

you could just take the class you are interested in ie http://crawler.archive.org/xref/org/archive/crawler/datamodel/Robotstxt.html

Xavier Combelle
I was kinda of hoping for something that did a bit more all in one package. It is possible that heritrix is the right thing for the job - maybe I just need a little more direction.
twofivesevenzero
It's hard to answer what do you means exactly by index? If it's just download it. the class URL and the method openConnection is done for that. See http://download.oracle.com/docs/cd/E17476_01/javase/1.4.2/docs/api/java/net/URL.html#openConnection%28%29
Xavier Combelle
I am looking to do a bit more than just download it. I would like to be able to check whether this is page that is "meaningful" (i.e. it is not behind a paywall or a login screen, etc), then download the html, and finally extract the plaintext for indexing. The biggest problem right now is figuring out if the page is meaningful.
twofivesevenzero
This actually ended up working very well. Created a Robotstxt object and then called getDirectivesFor("<some-bot-type>").allows(url.getPath());
twofivesevenzero