views:

583

answers:

20

I just noticed that one of my questions, http://stackoverflow.com/questions/449207/how-can-i-call-activatekeyboardlayout-from-64bit-windows-vista came up in a google search at another site. http://devmeat.com/show/898409

It got me thinking, why would devmeat repackage SO content? Web traffic, money, maybe even an altruistic desire to bring value to their readers.

So, is there anything a web programmer do to prevent this type of wholesale "repackaging" of content?

Note: I'm looking for a technological solution, not a legal one

EDIT: Here is the real world problem I'm trying to solve. I have spent the last 5 years creating a Punjabi x English Dictionary. I am interested in making it available through a web interface, but am concerned (maybe needlessly so) that someone will write a bot script to send over 30,000 English words and capture the translations.

Stephen writes below: "The whole informatics revolution is about being able to copy for (nearly) free"

So, now I am faced with the personal question about IP vs "gift to the world". BTW, I've made no decisions, just wrestling with the question.

David's comment "You can at least cut down on commercial use of what you put out..." strikes a cord with me. I give away a windows based version of the program for free. and after reading these comments, I've identified that my concern is that I don't want others to package my work and resell it. So maybe the solution is a legal one after all :)

+5  A: 

Nothing that is worth the effort. At best you could convert all your text to images and hope no one OCRs it.

Web content is downloaded to the client side. It's a simple as that. Anything visible on your site is public.

Your other option is to hire a lawyer to sue for copyright infringement. (If you can find the dastards to sue)

For a technical solution, you just want to make it harder for bots to steal your textual content? You have many choices, none of which are bulletproof.

  • Require users to login to see it.

  • Convert everything to a flash movie, that doesn't include selectable text.

  • Convert your text to GIF or PNG
    images (and increase the data size by at least 10x)

There are others, but most people would advise you not to go that route, unless you can give a more specific situation and set of requirements.

tkotitan
converting to images would be worse than letting the IP go imho...
annakata
increasing the data size by 10x is a very lousy idea. adversaries will take 20 hours instead of 2... so what? it will use up more of your bandwidth for genuine visitors and worsen their UX.login doesn't help either. a logged in user can also be an adversary.
aleemb
@aleemb: increasing size is a unavoidable side-effect of converting text to images, not a feature IMHO.
Piskvor
+3  A: 

Technically, there's no way to prevent copying. At most it can be made a bit difficult, but it's certainly not worth it. Legally, you could just prohibit it. Then again, prohibiting doesn't prevent anything...

Joonas Pulakka
A: 

You can't really do anything about it. It's either in the public domain publicly available or it's not.

EDIT: I'm not talking about whether it is legal or not, the poster is asking a technical question, I am suggesting that once you post it on the Internet, it effectively enters the "public domain" in the sense that any one can do what whatever they want do with it. Whether this is legal or not is irrelevant if the person doing it is quite happy to engage in illegal activities.

DrG
"public domain" is a legal term with a specific meaning, that the owner specifically waives all rights to the content and anyone can legally do with it as they like. Everything you write on SO is copyright by you. You mean publicly available.
KeithB
Nice remark. As much as I know even if you don't state anything on your page, it will not automatically put your content in the public domain.
User
As presumably technical people, I think that it is important that we use the correct terminology as much as possible. There is enough confusion around copyright and patents as it is.
KeithB
Thank goodness, I just scrolled down having written more or less the same comment about the misuse of the term "public domain" on another answer.
Rob
+1  A: 

Well, the content on this site is released under a CC license so you can't prevent certain 'repackaging' as you call it. Generally you can't do much about it, except mailing them if they can remove it.

Other than that Google and other search engines are constantly improving their duplicate content filters. Just don't bother too much with it, not worth your time.

Tomh
A: 

Not unless you can pay to hire a team of lawyers.

EBGreen
Team of lawyers can do nothing to the guys sitting in some other country especially if their local legislation does not punish for copying stuff.
User
A: 

You would base your site on the principles that anything on it is copyrighted to yourself. Therefore, if someone steals from you you can take legal action against it.

Of course, this can never work on a site such as stack overflow as the entire concept is based around sharing your intellect to the community (the world).

Robin Day
A: 

If you obfuscate your content or deliver a bad experience to attempt to protect it you run the risk of fading into obscurity like devmeat.com will.

BC
+1  A: 

I imagine devmeat are doing some sort of Feed aggregation.

SO publishes Atom feeds for each question - see http://stackoverflow.com/feeds/question/449207

Plus a recent questions feed - http://stackoverflow.com/feeds

So in effect SO is saying - "Here - please have this content."

With your own content on your own site, once it's on the Internet there is not really anything technically you can do to stop people - as others have said it's just a legal issue.

DanSingerman
A: 

Just because something is publically available it does not mean that there is no Copyright (in the UK). The author still owns the copyright to the work, but as mentioned above there is nothing you can do to PREVENT it happening from a technical perspective.

Thats the nature of HTML and the whole way the World Wide Web works, blame TBL (actually blame the nefarious individuals who cannot think of their own original content).

EDIT: Removed reference to the term 'Public Domain' as I did not mean it in its legally defined usage.

Charlie
Oh yes I agree, I'm just saying that id it is in the "public domain" you can't actually stop people from taking it.
DrG
Ugh, confusion of terminologies. "Public domain" is a term which, when applied to intellectual property, does in fact mean that there is no copyright, either because it has expired, or because it was never eligible for protection in the first place. Don't confuse "in the public domain" with "published on the World Wide Web".
Rob
A: 

If the process is an automatic one then you can take some steps, such as: only include brief snippets of your content in all your feeds. You can also output a chunk of your body text using javascript which means that most automated solutions will miss it out. Unfortunately, that also means search engines won't index that content either. You can't have it both ways! :)

Ralpharama
A: 

Technically: You can't do anything about it.

(You can make it harder, but this usually makes it harder for your users to use the site too, so don't.)

christian studer
+1  A: 

The whole informatics revolution is about being able to copy for (nearly) free. Artificially restricting that (by law) only reduces your ability to compete. Business models based on copying and distribution being expensive (publishing, music) will be replaced.

[edit] The same technology allowing you a public of millions also allows everyone to copy your content. Making money out of it can be done by providing added value:

  • letting people know you're the expert who created it and can do consultancy/ paid extensions;
  • being faster/more up to date. You might be able to improve the content faster than the bots pick it up;
  • simply asking money to the part of the market that likes support. A market exists only of the people who are willing to pay for it.
Stephan Eggermont
Nice point. This helped me clarify what and why I was looking for this solution; See my EDITS above
Noah
A: 

If you have to remain text-based, about all you can do is monitor and filter. If you know, by your logs, you're getting a disproportionate measure of traffic from a particular source, you can deny requests from that source, using one or more properties of the request. It's a crazily moving target, and totally unguaranteed, but it's an option.

Also, if you're not publishing feeds intended for others to read legitimately, you can vary the structure of your documents (assuming you're generating them dynamically) in slight ways that'd disrupt screen-scraping efforts. Again, totally not guaranteed, and likely to have adverse effects, but something.

If securing your content is that important enough to you, though (as it was for one of my clients), Flash is definitely an option. Provided you can get the content to the SWF securely, and you code your Flash app to support deep linking, your visitors will be able to read it, and (with the Flash search player currently under active development), your content will be search-engine findable as well.

Even so -- no guarantees. Copy-paste, OCR, etc. -- there'll always be workarounds. The only question's how far the hack's willing to go to achieve them. All you can really do is deter.

Christian Nunciato
Why the downvote? Something wrong with my post?
Christian Nunciato
A: 

In the short term, you can convert your site to AJAX - none of the content is contained in the page originally downloaded, instead javascript is executed which locates the necessary content and displays it.

This would require the bot authors to specifically attack and customize their bot for your site either by analyzing the javascript (which is less effective since you can merely change the script a bit, and the URLs it pulls from to make them go again) or by implementing a javascript engine and then ripping the 'rendered' page (maybe use greasemonkey).

Either option is painful, and unless you have very desirable content it's not worth it, so it's as effective as it needs to be.

If your content is very valuable, though, then the only thing you can do is make it hard for quick hackers to get at (such as the above) and then employ bots to search for infringements and automatically send DMCA takedown notices. This is relatively hands-free work, so it's not as onerous as you might think, and it is reasonably effective.

Adam Davis
+3  A: 

You could limit the amount of content that one user (IP address, presumably) is allowed to receive through your web interface, similar to how Google Books restricts the amount of pages you can view for some books.

Not that it's hacker-proof, but it could be one approach.

mquander
+3  A: 

The general rule is that, if I can see it on a computer, I can copy it. People have been trying to change this for years, not very successfully. The more successful attempts involve carefully written software that needs to be run in order to see whatever (this is usually done through cryptography), and resists use for any other purposes.

As it happens, people don't generally surf the web with special software; they surf it with programs like Firefox and Internet Explorer and Safari and Opera. There's no way you can serve information up to a web browser and still keep any control over it.

Your only recourse is legal. Put a copyright notice on the page, and decide who you're prepared to take legal action against. You can at least cut down on commercial use of what you put out in that way, although it won't be worthwhile to go after noncommercial use. There is no technological solution.

One thing to consider if you do decide to put your dictionary on-line: provide a full computer-readable download with whatever copyright license you like (there's a large variety of creative commons licenses, and one may be suitable for you). That way, fewer people will hit the site tens of thousands of times to copy your dictionary word by word.

David Thornley
A: 

Devmeat is NOT ripping any content as you are trying to say (or I misunderstood ?). Devmeat.com is RSS/feed aggregator specialized for software developers. For me, basically it's place I can 'go and check what's new' between work tasks and I have filtered/categorized news. It's using RSS which is publicly available on stackoverflow (and many other sites), not some kind of site crawling or sh*t like that. It's completely legal. If people from stack overflow do not wish to publish content via RSS, they would not to do that. But they did. As many other sites, because it's good for them - traffic.

And answer for your question, you are asking in wrong context. From context you speak, I should just say 'don't publish your data as RSS', because RSS was MADE for that - that's what devmeat is doing - RSS aggregation.

Otherwise, if you want prevent people ripping your data (with bots/crawling) I think there's nothing you can do to be completely safe. Simply because you publish data on public place. I think ripping content IS bad. But you gave wrong example.

jozefsevcik
A: 

I would suggest that if you don't want the content to be repackaged, then you will have write your own client and transmit data to it in an encrypted fashion. E.g., I think a Java applet could do what you want, with a bit of rewriting on the text rendering to disallow copying & pasting.

If you want to focus on providing a really great client and let the content slide, you run a big risk of some other 1-man shop developer doing a better job.

Personally, I'd suggest locking the whole set of information and client down and getting a solid license.

Paul Nathan
A: 

I would disagree on principle, but I belief that Flash is the solution.

Overflown
A: 

You should disable right mouse click as some sites do. Then you should also put strongly worded legal stuff : "Violating copyright laws is a serious thing."

fastcodejava
BOTs use right click eh?
Kristen