First, declare your intentions in robots.txt
.
Then, send a Set-Cookie header with a nonce or some kind of unique ID on the main page, but not on your API responses. If the cookie is never sent to your API endpoint, return a 401 Bad Request
response, because it's a bot, a very broken browser, or somebody is rejecting your cookies. The Referer header can also be used as an additional check, but it's trivial to fake. Keep track of how many API calls have been made by that ID. You may also want to match IDs to IP addresses. If it goes above your threshold, spit back a 403 Forbidden
response. Make your threshold high enough that legitimate users don't get caught by it.
Keep good logs, and highlight 401 and 403 responses.
Realistically, if someone is determined enough, they WILL be able to dump this information. Your goal shouldn't be to make this impossible, because you will never succeed. (See all the usual adages about achieving perfect security.) Instead, you want to make it abundantly clear that:
- This behavior violates the terms of service.
- You are actively trying to prevent this.
- You know that the offender exists and roughly who they are.
- Scary lawyers might start getting involved if this continues.
(You do have a lawyer, right?)
To achieve this, be sure the body of your 403 Forbidden
response conveys a scary sounding message along the lines of "This request exceeds the maximum allowed usage of the API. Your IP address has been logged. Please refer to the terms of service and obey the directives in robots.txt
."
IANAL, but I believe the DMCA can be made to apply in this situation if you claim copyright on your database. This essentially means that if you can track illegal usage of your API to an IP address, you can send a nastygram to their ISP. This should always be a last resort of course.
I don't encourage the use of assigned API keys/tokens because they turn out to be a barrier to adoption and kind of a pain in the neck to manage. As a counter-point to @womp's answer, Google is slowly moving away from their use. Also, I don't think they actually apply in this case, because it sounds like your "API" is more like a JSON call that's used mainly on your own site.