views:

66

answers:

2

I'm pretty sure that many people have thought of this, but for some reason I can't find it using Google and StackOverflow search.

I would like to make an invisible link (blacklisted by robots.txt) to a CGI or PHP page that will "trap" malicious bots and spiders. So far, I've tried:

  1. Empty links in the body:

    <a href='/trap'><!-- nothing --></a>
    

    This works quite nicely most of the time, with two minor problems:

    Problem: The link is part of the body of the document. Even though it is pretty much unclickable with a mouse, some visitors still inadvertently hit it while keyboard-navigating the site with Tab and Enter. Also, if they copy-paste the page into a word processor or e-mail software, for example, the trap link is copied along and sometimes even clickable (some software don't like empty <a> tags and copy the href as the contents of the tag).

  2. Invisible blocks in the body:

    <div style="display:none"><a href='/trap'><!-- nothing --></a></div>
    

    This fixes the problem with keyboard navigation, at least in the browsers I tested. The link is effectively inaccessible from the normal display of the page, while still fully visible to most spider bots with their current level of intelligence.

    Problem: The link is still part of the DOM. If the user copy-paste the contents of the page, it reappears.

  3. Inside comment blocks:

    <!-- <a href='/trap'>trap</a> -->
    

    This effectively removes the link from the DOM of the page. Well, technically, the comment is still part of the DOM, but it achieves the desired effect that compliant user-agents won't generate the A element, so it is not an actual link.

    Problem: Most spider bots nowadays are smart enough to parse (X)HTML and ignore comments. I've personally seen bots that use Internet Explorer COM/ActiveX objects to parse the (X)HTML and extract all links through XPath or Javascript. These types of bots are not tricked into following the trap hyperlink.

I was using method #3 until last night, when I was hit by a swarm of bots that seem to be really selective on which links they follow. Now I'm back to method #2, but I'm still looking for a more effective way.

Any suggestions, or another different solution that I missed?

+6  A: 

Add it like you said:

<a id="trap" href='/trap'><!-- nothing --></a>

And then remove it with javascript/jQuery:

$('#trap').remove();

Spam bots won't execute the javascript and see the element, almost any browser will remove the element making it impossible to hit with tabbing to it

Edit: The easiest non-jQuery way would be:

<div id="trapParent"><a id="trap" href='/trap'><!-- nothing --></a></div>

And then remove it with javascript:

var parent = document.getElementById('trapParent');
var child = document.getElementById('trap');
parent.removeChild(child);
Sander Rijken
What if some clients don't have JavaScript enabled?
Maxim Zaslavsky
Then they don't see the link, but have the disadvantage of being able to tab to the link. It's not a 100% fix, but I don't think there are lots of clients that will tab to the non-visible link that have javascript disabled.
Sander Rijken
This seems to be a great solution! I'm not really worried about the JavaScript dependency, since it affects only the intersection of two minorities (those who discover the links by copy-pasting or tabbing, and those who have scripts disabled). Could you expand your answer to include a solution that doesn't depend on jQuery? I think that it should involve document.GetElementById() and node.parentNode.removeChild(), but my experience with Javascript is quite limited.
Juliano
+1  A: 

this solution seems to work well for me, luckily i have bookmarked it. I hope it helps you as well.

you can create a hidden link like this and put it at the very top left of your page and to prevent regular users from accessing it too easily you can use css to lay a logo image over this image.

<a href="/bottrap.php"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a> 

if you are interested in setting up how to blacklist the bots refer to this link for detailed explaination of howto.

http://www.webmasterworld.com/apache/3202976.htm

cmptrwhz
Correct me if I'm wrong, but does that discussion contain an answer to how to add a link to the trap? If so, please mention it in your answer for easier discovery (and leave the link for those that want to know the specifics)
Sander Rijken
The trap itself is not the problem. I use a daemon that receive messages from the trap scripts and add the IP address to a iptables "recent" match rule that bans the IP for 1 hour. The problem is adding the link to pages in a clean way. Your solution is no better than the methods I'm using, since not only the link is accessible (and clickable), if the user copy-paste the text it copies the link _and_ the image. Sometimes the image is embedded, sometimes it becomes broken, and sometimes the application adds the link base to the original link.
Juliano