I'm tinkering with a web tool that, given a URL, will retrieve the text and give the user some statistics on the content.
I'm worried that giving users a way to initiate a GET request from my box to any arbitrary URL on the net may serve as a vector for attacks (e.g. to http://undefended.box/broken-sw/admin?do_something_bad
).
Are there ways to minimize this risk? Any best practices when offering public URL retrieval capacity?
Some ideas I've thought about:
- honoring
robots.txt
- accepting or rejecting only certain URL patterns
- checking blacklist/whitelist of appropriate sites (if such a thing exists)
- working through a well known 3rd party's public web proxy, on the assumption that they've already built in these safeguards
Thanks for your help.
Edit: It'll be evaluating only HTML or text content, without downloading or evaluating linked scripts, images, etc. If HTML, I'll be using an HTML parser.