tags:

views:

257

answers:

4

We are planning to put large number of Business Research Reports and Articles from our intranet on to the Internet. However, we don't want others to copy the content and host it on their own.

I read about protection by CAPTCHA and was wondering if this is possible. Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article [In this way we are making life little harder for those copycats]

Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.

Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?

Thanks.

A: 

Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article

Have your PHP programmer output 50% of the article. On the bottom, add a captcha. If the user types in the correct captcha, output 100% of the article.

Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.

As a PHP programmer, I use http://www.phpcaptcha.org to implement captcha.

Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?

No, it won't penalize you but that particular section will not be shown on the search results.

MrValdez
+1  A: 

There's a really good Captcha service provided by Recaptcha - http://recaptcha.net/

There is a PHP class that you can use to do all the hard work.

It's important to bear in mind that search engines aren't able to solve a Captcha and so they will only index the first half of the report. As long as this half contains largely the correct key words, it shouldn't cause a massive problem. Don't make the mistake of "detecting" a search engine and showing them different content to a normal user as the major search engines think that this is spamming.

An alternative solution would be to use a service like Copyscape (http://www.copyscape.com/) to protect your content.

Sohnee
detecting a search engine: also, this leaves an opening for scraping bots to pretend to be a search engine, and bypass the captcha, thus making it fairly worthless
Cebjyre
+1  A: 

I know this is not what you're asking, but please take into account that CAPTCHAs are universally broken, and will not protect your content. You said the first half is free, does that mean you intend to charge for the other half? CAPTCHA won't help you here at all...

But even if you're just trying to prevent automated scraping, CAPTCHA still won't do the trick. Check out my answer to another captcha question... Or you can go straight to the ppt I presented at OWASP last year.

AviD
I help out with a forum on which spam was a huge pain in the body part; reCAPTCHA instantly eliminated 100% of it. Does it not preventing content from being copied by humans mean that it's "broken"? Not exactly.
reinierpost
Well, if they still wanted to spam your forum, they would just have to invest a little more effort (which IS a good thing), but eventually it would have been possible. So it really just comes down to a value question - is the forum/content/resource worth enough to make that extra effort...
AviD
And wrt it being "broken" - there are 3 main issues with captcha: a.Implementation bugs; b.Breakable images (e.g. OCR); and c. the *concept* itself - I always say CAPTCHA is solving the WRONG problem. If the content is being scraped, does it matter if its a script or a roomful of indians?
AviD
A: 

As already mentioned reCAPTCHA is a good way to go.

Have a look at Captcha::reCAPTCHA on CPAN which according to the CPAN rating reviews "Works out of the box"

If your want Captcha then there are plenty of modules that do this on CPAN ;-)

Hope that helps.

/I3az/

draegtun