views:

198

answers:

5

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).

Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.

That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.

Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

A: 

Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.

If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.

Byron Whitlock
Mysql isnt the problem. Mysql server is underused, since everything is memcached and optimized as hell. Php connections is whats doing it. If there was no attack, the server can handle it no problem.
Yegor
Have you benchmarked the PHP code to determine what code is the source of the bottleneck?
sutch
@Yegor - So basically the real problem is that the server cannot handle a DDOS attack. You should make that the root of your question.
Joel L
A: 

Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.

Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

Plumo
would be more helpful to leave a comment that just a downvote...
Plumo
A: 

Every thing you do on the client side can't be protected, Why not just use AJAX ?

Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.

You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.

Radu
Still had to run a select query, which defeats the purpose entirely.
Yegor
My bad i didn't see the part where you said that you do not whant to use PHP :) If you use just client side scripting there is no method to prevent srapers, you cand make you JS minimized, encode it, and make variables and function with names that have no meaning function a(), var a_0 etc ..., this will prevent 90% of scrapers (beginers) but will not prevent the advanced ones :(If you have the information out of mysql on first load of page, you could save it to SESSION, and then use AJAX just to read from session (still use PHP, but not query's the database again.)
Radu
Quering the DB isnt a problem, its under no stress. Using PHP is what I wanna keep to a minimum. I run the update query with ajax in the background, so even if it hangs for 2-3 seconds, its no big deal. But when it hangs for 2-3 seconds before loading the link, it is a big deal. JS solution seems like the only way to go....
Yegor
+1  A: 

You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.

chris
+1  A: 

It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.

Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).

As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).

sutch
Its not. DB server is under no stress whatsoever.
Yegor
Have you benchmarked the PHP code to determine what code is the source of the bottleneck?
sutch
To add to what @sutch is saying, the problem is probably with the simultaneous inserting and reading the database. If you don't control your apache and you can't set up IP Throttling as in @chris's answer, you should at least not read the IP list during request time. Insertint is fine, then just make a script run every 5 minutes that reads through the table and parses a list of all banned IP addresses and place it in a plaintext file. Then make the script, during each request just open that file and check if the IP is on that list. (Best would be to do that in memory instead of file).
arnorhs