views:

235

answers:

7

A client of mine who is a social sciences researcher at a university is asking if I can write a spider to do statistical data mining from a subscription-only academic database. He would like to use the statistics for his academic research.

(For those interested, this would involve downloading thousands of text documents and then doing linguistic analyses to look for the frequency of certain words and phrases to test how language is used. The documents themselves would not be republished or reproduced in any way.)

I am trying to determine whether this type of work is generally considered permissible (e.g. fair use). The website's terms of service do not appear to specifically prohibit screen scraping. When I get a chance I will ask a friend who is a lawyer, but in the meanwhile, does anyone have pointers to information on when this kind of data mining work is considered fair use?

(This question was relevant and answered part of my question; I am looking for more information specifically on data mining without republication.)

+9  A: 

I would explain your case to the site owner and ask.

Ólafur Waage
Start with the simple solution and go from there. I fail to see why an academic oriented website would have a problem with someone using the site for academic research when there is no reproduction of the original material.
rism
+1 and agreed rism. It's weird how people jump straight to the quesitonable solution rather than the simple one of just asking.
cletus
+1  A: 

If he pays for an account and you are simply automating something he could do on his own (ie download all the text, go thru line by line and pull out relevant data) I think you are ok. This is of course assuming they do not specifically prohibit it in their terms of service and nothing you do could be interpreted as a slam or dos type of attack.

brendan
A: 

It is not permitted, but you can ask them nicely (you may even avoid the scraping if they are kind enough to provide the documents)

Eduardo Molteni
+1  A: 

If you friend is eventually going to publish his results, he'll have to cite his sources. Will that not be a problem then? It's better to ask for the site's authorities' permission for use.

dirkgently
A: 

If you are only "reading" the documents and constructing your own data based on their contents, without making a copy of the contents, it should be perfectly acceptable. If you kept a copy of the original on your own site for more than the amount of time used to construct your statistics that might be a violation depending on the terms contained in the subscription.

On the other hand, there may be a better way to get access to the data your friend needs that wouldn't cause as much traffic for the site. You may want to investigate whether they would make the corpus available on DVD for use in this research. They may appreciate not having their site hit by your spider.

tvanfosson
A: 

I'm not a lawyer, but my two cents:

If the data derived from the analysis is purely for research and not for profit, then it may be OK. I wouldn't do it without express permission from the owner of the site. Legally you want permission, because if it goes to litigation it's too late for forgiveness :)

Dave Swersky
+1  A: 

From my understanding of copyright law, I'd say you're on pretty solid ground. The most important thing is that you're not actually reproducing the content. It would be pretty hard to argue that statistics about word frequency are part of the author's creative work. My guess is that this falls in the same area as facts contained within a copyrighted work (which are not protected).

Even if you were doing something slightly more substantial with the works, you would have a pretty good claim to fair use of the work. Here are the factors frequently taken into consideration in determining fair use, according to the US Government:

  1. the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes;

  2. the nature of the copyrighted work;

  3. amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

  4. the effect of the use upon the potential market for or value of the copyrighted work.

Given that you're using the works for an academic, nonprofit purpose that will have little to no effect on the value of the copyrighted work, you'd be sitting pretty, as long as you didn't reproduce large swaths of the texts.

All that said, the TOS are by far the most likely place for something to prohibit it, so read them again, extra carefully.

And, of course, be aware that I'm not a lawyer and have no formal legal training.

IanGreenleaf