I know that user agents are one indicator, but that's easy to spoof. What other reliable indicators are there that a visitor is really a bot? Inconsistent headers? Whether images/javascript are requested? Thanks!
"Whether images/javascript are requested?" I would go for this one, however Google and others request images and javascript files nowadays.
How about request time speed? Bots read your content a lot faster than humans do.
CVSTrac uses a honeypot page to accomplish this. It's a page linked somewhere on the site where crawlers reach it, but humans usually ignore it. CVSTrac goes one step further by allowing the user to prove that he is human.
There are 4 things that we look for:
The user agent string. It is very easy to fake, but often crawlers will use their own unique user agent string.
The speed of access of pages, if they access more than one every half second or so, that's usually a good indication
If they request just the HTML, or if they request the entire page. Some crawlers will only ask for the HTML structure. This is usually a good tip off.
The incoming url
Take a look at Bad Behavior, a library which employs a wide array of bot detection techniques
A reverse captcha of sorts can help as well; you could create an text input field with display: none; in it's style attribute (or your stylesheet). If it's posted to, chances are you're dealing with a bot.
Edit: This was actually something that had been aggregated in my RSS reader, if I can find the source, I'll link a good example.