I have control over the HttpServer but not over the ApplicationServer or the Java Applications sitting there but I need to block direct access to certain pages on those applications. Precisely, I don't want users automating access to forms issuing direct GET/POST HTTP requests to the appropriate servlet.

So, I decided to block users based on the value of HTTP_REFERER. After all, if the user is navigating inside the site, it will have an appropriate HTTP_REFERER. Well, that was what I thought.

I implemented a rewrite rule in the .htaccess file that says:

RewriteEngine on 

# Options +FollowSymlinks
RewriteCond %{HTTP_REFERER} !^http://mywebaddress(.cl)?/.* [NC]
RewriteRule (servlet1|servlet2)/.+\?.+ - [F]

I expected to forbid access to users that didn't navigate the site but issue direct GET requests to the "servlet1" or "servlet2" servlets using querystrings. But my expectations ended abruptly because the regular expression "(servlet1|servlet2)/.+\?.+" didn't worked at all.

I get really disappointed when I changed that expr to "(servlet1|servlet2)/.+" and worked so well that my users were blocked no matter if they navigated the site or not.

So, my question is: How do I can accomplish this thing of not allowing "robots" with direct access to certain pages if I have no access/privileges/time to modify the application?

+1  A: 

I don't have a solution, but I'm betting that relying on the referrer will never work because user-agents are free to not send it at all or spoof it to something that will let them in.


I'm guessing you're trying to prevent screen scraping?

In my honest opinion it's a tough one to solve and trying to fix by checking the value of HTTP_REFERER is just a sticking plaster. Anyone going to the bother of automating submissions is going to be savvy enough to send the correct referer from their 'automaton'.

You could try rate limiting but without actually modifying the app to force some kind of is-this-a-human validation (a CAPTCHA) at some point then you're going to find this hard to prevent.


You can't tell apart users and malicious scripts by their http request. But you can analyze which users are requesting too many pages in too short a time, and block their ip-addresses.

Michiel de Mare

Javascript is another helpful tool to prevent (or at least delay) screen scraping. Most automated scraping tools don't have a Javascript interpreter, so you can do things like setting hidden fields, etc.

Edit: Something along the lines of this Phil Haack article.

Chris Benard
+1  A: 

Using a referrer is very unreliable as a method of verification. As other people have mentioned, it is easily spoofed. Your best solution is to modify the application (if you can)

You could use a CAPTCHA, or set some sort of cookie or session cookie that keeps track of what page the user last visited (a session would be harder to spoof) and keep track of page view history, and only allow users who have browsed the pages required to get to the page you want to block.

This obviously requires you to have access to the application in question, however it is the most foolproof way (not completely, but "good enough" in my opinion.)

Dan Herbert

If you're trying to prevent search engine bots from accessing certain pages, make sure you're using a properly formatted robots.txt file.

Using HTTP_REFERER is unreliable because it is easily faked.

Another option is to check the user agent string for known bots (this may require code modification).


To make the things a little more clear:

  1. Yes, I know that using HTTP_REFERER is completely unreliable and somewhat childish but I'm pretty sure that the people that learned (from me maybe?) to make automations with Excel VBA will not know how to subvert a HTTP_REFERER within the time span to have the final solution.

  2. I don't have access/privilege to modify the application code. Politics. Do you believe that? So, I must to wait until the rights holder make the changes I requested.

  3. From previous experiences, I know that the requested changes will take two month to get in Production. No, tossing them Agile Methodologies Books in their heads didn't improve anything.

  4. This is an intranet app. So I don't have a lot of youngsters trying to undermine my prestige. But I'm young enough as to try to undermine the prestige of "a very fancy global consultancy services that comes from India" but where, curiously, there are not a single indian working there.

So far, the best answer comes from "Michel de Mare": block users based on their IPs. Well, that I did yesterday. Today I wanted to make something more generic because I have a lot of kangaroo users (jumping from an Ip address to another) because they use VPN or DHCP.

+2  A: 

I'm not sure if I can solve this in one go, but we can go back and forth as necessary.

First, I want to repeat what I think you are saying and make sure I'm clear. You want to disallow requests to servlet1 and servlet2 is the request doesn't have the proper referer and it does have a query string? I'm not sure I understand (servlet1|servlet2)/.+\?.+ because it looks like you are requiring a file under servlet1 and 2. I think maybe you are combining PATH_INFO (before the "?") with a GET query string (after the "?"). It appears that the PATH_INFO part will work but the GET query test will not. I made a quick test on my server using script1.cgi and script2.cgi and the following rules worked to accomplish what you are asking for. They are obviously edited a little to match my environment:

RewriteCond %{HTTP_REFERER} !^http://(www.)?example.(com|org) [NC]
RewriteCond %{QUERY_STRING} ^.+$
RewriteRule ^(script1|script2)\.cgi - [F]

The above caught all wrong-referer requests to script1.cgi and script2.cgi that tried to submit data using a query string. However, you can also submit data using a path_info and by posting data. I used this form to protect against any of the three methods being used with incorrect referer:

RewriteCond %{HTTP_REFERER} !^http://(www.)?example.(com|org) [NC]
RewriteCond %{QUERY_STRING} ^.+$ [OR]
RewriteCond %{PATH_INFO} ^.+$
RewriteRule ^(script1|script2)\.cgi - [F]

Based on the example you were trying to get working, I think this is what you want:

RewriteCond %{HTTP_REFERER} !^http://mywebaddress(.cl)?/.* [NC]
RewriteCond %{QUERY_STRING} ^.+$ [OR]
RewriteCond %{PATH_INFO} ^.+$
RewriteRule (servlet1|servlet2)\b - [F]

Hopefully this at least gets you closer to your goal. Please let us know how it works, I'm interested in your problem.

(BTW, I agree that referer blocking is poor security, but I also understand that relaity forces imperfect and partial solutions sometimes, which you seem to already acknowledge.)


You might be able to use an anti-CSRF token to achieve what you're after. This article explains it in more detail:

Ian Oxley