What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose. I looked into heritrix, but this seems like way more than I need. Is there a simpler tool that will provide information about robots.txt and scrape site accordingly?
(Also, I don't need to follow additional links and build up a deep index, I just need to index the individual pages in the list.)