views:

292

answers:

6

I went to a PHP job interview, I was asked to implement a piece of code to detect visitors are bots to crawl thru the website and steal content.

So I implemented a few lines of code to detect if the site is being refreshed/visited too quickly/often by using a session variable to store last visit timestamp.

I got told that session varaibles can be manupilated by cookies etc, so I am wondering if there is a application variable that I can use to store the timestamp information against visitor IPs eg $_SERVER[REMOTE_ADDR]?

I know that I can write the data to a file but it's not very good for a high traffic website.

Regards

James

+1  A: 

In a word, no. Your options are cookies, session vars (aka server-side cookies) and storage (file/db).

adam
session vars/server-side cookies?? How is the user identified then? (so that the server knows which session to load up). That's the whole point of cookies being client-side.
Mark
+3  A: 

I got told that session varaibles can be manupilated by cookies etc,

Just to be clear, clients can't edit session variables to their liking. They can delete or change PHPSESSID, however, which grants another session. Global variables (ie. $_SERVER) are not persistent, so any changes you make to them will not make it to the next page load.

The best way to go about detecting crawlers is to store the IP address, user-agent and timestamp of all page loads in a database. The overhead is miniscule.

Johannes Gorset
I *believe* the session can be hacked if the session id is carried in the url
adam
But clients can not accept cookies, thwarting the use of cookie based sessions. However, IP based sessions should be slightly more reliable.
Eric J.
Clients *choose* whether or not to send cookies and what cookies to send. That's the point.
cletus
They can, however, delete or refuse the session ID.
ceejayoz
@adam: No. Carrying the session identifier in the URL is no less secure than using cookies.
Johannes Gorset
@cletus: You're right. The solution I'm proposing has nothing to do with cookies.
Johannes Gorset
The session can only be hacked if they discover someone else's SESSID by packet sniffing or something. Or, if it's in the URL, I suppose they could peer over someone's shoulder and scribble it down. Of course, you can just easily modify a Cookie with your stolen SESSID. Why aren't IP's used in conjunction with SESSIDs for added security?
Mark
Ah, my mistake. Just read up on it, I was thinking of impersonation issues. Either way, I like to think I can bring a debate to the table :D
adam
@adam: Well, you could be impersonating someone if you stumble upon another user's session identifier. ;-)
Johannes Gorset
@Mark: Some ISPs change your IP address seemingly at random, so I presume it's to maintain compatibility for users who fall into this category.
Johannes Gorset
Yes, I had a thought of using IPs, but my primary problem at that time was to decide how to store the timestamp and data, given the fact that I only had 15 min to implement that on their laptop, I didn't use file bcos I thought I would get crtizied by the overlead IO, so I chose session variable just to demonstrate the idea.
James Lin
@James: You should know that session variables are usually stored in files. ;-) At the end of the day, if you want to save something, you *have* to write it to a file in one way or another.
Johannes Gorset
A: 

Bots can ignore saving the cookie data (as in not passing the session variable back). The best option would be to use some sort of external DB or storage system. Like a C++ socket program that simply stores IP and compares it recent connections.

St. John Johnson
+1  A: 

Your best bet for this might be after-the-fact analysis of the logs. It won't stop content theft on-the-fly, but it'll be much easier to find abuse patterns and block those IPs from future accesses.

ceejayoz
+1  A: 

You would need store the IP and timestamps server-side. It's unlikely that a bot would send cookies, and even a URL based session is not reliable.

The overhead of a file should not be too much, unless you are just doing flat-file logging which will kill you. You can use SQLite or similar, perhaps stored on a memory based filesystem for a small speed boost. Or you could go with something like memcached. If you need to persist the data, use MySQL. The overhead of a full-blown database is practically nothing compared with the time it takes PHP to do pretty much anything.

If you really want to do something like this with sessions, display a user agreement page unless there is a defined "I Agree" variable in the session. That way, if a bot doesn't send a valid session back, all it gets is the user agreement. If it does, then you can track it with session variables.

Bear in mind that the session-based solution is not necessary since you don't need to remember client state between requests, and that sessions will incur as much, if not more, overhead than most custom alternatives.

Regarding the statement that session variables can be manipulated by cookies, it's not entirely true. However, if you're silly enough to leave register_globals on and you ask for a global variable, I wouldn't like to hazard a guess as to whether it came from a session, a cookie, a query string, the environment, or was previously undefined. This is all moot if you explicitly access through $_SESSION of course.

Duncan
A: 

Don't expect to defeat them by refresh times alone. I did something very similar to combat contact form spam and some bots wait as long as people before taking the next action.

I'd look more at ip addresses who load just the html document, and ignore things like favicon, css stylesheets, etc. If you set css files to parse php you can put some logic in there to say that ip looks legit. Just be careful about caching.

Also, are you taking steps to make sure you don't lock out legitimate bots like the googlebot?

Syntax Error