views:

287

answers:

5

I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data. In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.

I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.

I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?

EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)

A: 

I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Martin Murphy
+3  A: 

My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.

Mobius
This is not a solution for the throttling problem, but it is indeed a smart trick! Thanks!
I'm sorry I didn't answer the question you asked! I'm way, way too dumb with RoR to even attempt to offer a programatical suggestion.
Mobius
+1  A: 

You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.

lothar
Wouldn't they just set an appropriate delay in how often they gather data?
nevets1219
@nevets1219 well you can not stop them completely, just slow them down or make their work harder. The OP already acknowledged that.
lothar
A: 

Duplicate of this excellent StackOverflow question from the developer of Woot.com.

Peter J
that is one sad question. But indeed it seems to cover most of the ground.
+1  A: 

You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).

chrisrbailey
This kind of triggers me to follow up with another question: I don't know about Rack. Can I use the rate limiting with Ruby On Rails, possible with Rack chucked in between?
Felix, I'm not sure if I fully understand that question, but... at least part of it depends on what your stack is and what your Rails version is. I think if you're on Rails 2.2 (maybe 2.1?) then you're set/compatible for Rack. Then, you'd need to be using a Rack based stack, and there are a variety of options. Passenger, or Thin, or what not. But really, the point is that Rack is part of your web server stack, and that's part of the beauty of that rate limiting implementation is that they do it essentially without you having to touch your app - it's all at the Rack middleware layer.
chrisrbailey