views:

92

answers:

1

I'm tinkering with a web tool that, given a URL, will retrieve the text and give the user some statistics on the content.

I'm worried that giving users a way to initiate a GET request from my box to any arbitrary URL on the net may serve as a vector for attacks (e.g. to http://undefended.box/broken-sw/admin?do_something_bad).

Are there ways to minimize this risk? Any best practices when offering public URL retrieval capacity?

Some ideas I've thought about:

  • honoring robots.txt
  • accepting or rejecting only certain URL patterns
  • checking blacklist/whitelist of appropriate sites (if such a thing exists)
  • working through a well known 3rd party's public web proxy, on the assumption that they've already built in these safeguards

Thanks for your help.

Edit: It'll be evaluating only HTML or text content, without downloading or evaluating linked scripts, images, etc. If HTML, I'll be using an HTML parser.

+2  A: 

Are the statistics going to be only about the text in the document? Are you going to evaluate it using a HTML parser?

If it's only the text that you're going to analyze, that is, without downloading further links, evaluating scripts, etc. then the risk is less severe.

It probably wouldn't hurt to pass each file you download through an Anti-Virus program. You should also restrict the GETs to certain content-types (i.e. don't download binaries; make sure it's some sort of text encoding).

Assaf Lavie