I'm pretty sure that many people have thought of this, but for some reason I can't find it using Google and StackOverflow search.
I would like to make an invisible link (blacklisted by robots.txt) to a CGI or PHP page that will "trap" malicious bots and spiders. So far, I've tried:
Empty links in the body:
<a href='/trap'><!-- nothing --></a>
This works quite nicely most of the time, with two minor problems:
Problem: The link is part of the body of the document. Even though it is pretty much unclickable with a mouse, some visitors still inadvertently hit it while keyboard-navigating the site with Tab and Enter. Also, if they copy-paste the page into a word processor or e-mail software, for example, the trap link is copied along and sometimes even clickable (some software don't like empty
<a>
tags and copy the href as the contents of the tag).Invisible blocks in the body:
<div style="display:none"><a href='/trap'><!-- nothing --></a></div>
This fixes the problem with keyboard navigation, at least in the browsers I tested. The link is effectively inaccessible from the normal display of the page, while still fully visible to most spider bots with their current level of intelligence.
Problem: The link is still part of the DOM. If the user copy-paste the contents of the page, it reappears.
Inside comment blocks:
<!-- <a href='/trap'>trap</a> -->
This effectively removes the link from the DOM of the page. Well, technically, the comment is still part of the DOM, but it achieves the desired effect that compliant user-agents won't generate the A element, so it is not an actual link.
Problem: Most spider bots nowadays are smart enough to parse (X)HTML and ignore comments. I've personally seen bots that use Internet Explorer COM/ActiveX objects to parse the (X)HTML and extract all links through XPath or Javascript. These types of bots are not tricked into following the trap hyperlink.
I was using method #3 until last night, when I was hit by a swarm of bots that seem to be really selective on which links they follow. Now I'm back to method #2, but I'm still looking for a more effective way.
Any suggestions, or another different solution that I missed?