views:

966

answers:

10

I have a serious question. I'm not trying to start a flamewar or to incite any violence--but here goes.

Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind:

1.) If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without clicking on ads that may support the site but the site owner is putting their site on the web, right, so how reasonable is it for them to expect that they'll never get visited by a bot?

2.) Some sites apparently use a robots.txt exactly in order to keep their site from being crawled by Google or some other utility that might grab prices and therefore allow people to do price comparisons easily. They have private search engines on the site so they obviously want people to be able to search the site; apparently they just don't want people to be able to easily compare their information with other vendors.

As I said, I'm not trying to be argumentative and I'm not trying to start a big argument; I would just like to know if anyone has ever come up with a case where it's ethically permissible to ignore the presence of a robots.txt file? I cannot think of a case where it's permissible to ignore the robots.txt mainly because people (or businesses) are paying money to put up their web sites so they should be able to tell the Googles/Yahoos/Other SE's of the world that they don't want to be on their indices.

To put this discussion in context, I'd like to create a price comparison website and one of the major vendors has a robots.txt that basically prevents anyone from grabbing their prices. I'd like to be able to get their information but, as I said, I can't justify simply ignoring the wishes of the site owner.

I am making this a community wiki question exactly because I believe it might generate some spirited debate. Or maybe not.

I suspect this discussion may belong elsewhere and if it does, just let me know. I have seen some very sharp discussion here and that's why I would like to hear the opinions of developers that follow Stack Overflow.

By the way, there is some discussion of this topic on a Hacker News question but they seem to mainly focus on the legal aspects of this.

+3  A: 

"No" means "no".

John Saunders
And everything written inside Terms of Service is perfectly true?
ilya n.
Maybe yes, maybe no, but assume it's true, and ask. Otherwise, you're assuming you're being lied to. I, personally, would be offended that you assumed I was lying. Others, instead of being offended, would call their lawyers. My suggestion: be nice.
John Saunders
+23  A: 

The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".

Greg Hewgill
This is a good point. Blogger, for instance, tells crawlers to ignore label searches because those should've all been found already.
cletus
So does stackoverflow: http://stackoverflow.com/robots.txt
Greg Hewgill
Thanks Greg. I've got no plans to ignore a robots.txt--I just wanted to know if there might be other things I hadn't considered.
Onorio Catenacci
A: 

If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."

Andrei Krotkov
Your analogy is imperfect. A lawn has a specific extent in space. It is possible to know when you've seen all of a lawn. Not so with a web site. The fact that your analogy is so far off, simply in terms of "physical extent" suggests to me that you may want to revisit your entire approach to this question.
John Saunders
In my opinion using robots.txt to attempt to hide something is like putting a sign outside your house that says "Do not use combination 22-18-76 to open the safe in the closet of the master bedroom"
Unkwntech
This is a bogus comparison. Looking at someone's lawn doesn't use their resources. Browsing or crawling their website does, so it's completely reasonable to impose limitations.
Matthew Flaschen
Unkwntech, everyone knows robots.txt is not for security (e.g. it's not a substitute for password-protection). Again it's simply about respect for the site owner's resources.
Matthew Flaschen
No. Adding a robots.txt file to your site is the equivalent of putting a sign on your lawn that says, "Only walk in the allowed locations, which are clearly marked."
Eddie
Please keep off the grass
johnc
I'd say it's much more like those ropes ushers use to reserve seats. You *can* bypass them easily, but you're not supposed to out of ethical/social pressure
Michael Haren
+33  A: 

Arguments:

  1. A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".
  2. Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
  3. Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
  4. Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
  5. There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
  6. Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
  7. Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
  8. In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.
richardtallent
Thanks. This is exactly the sort of discussion I was hoping for.
Onorio Catenacci
great comment. Very thorough.
MattK311
I agree with all points except 5.
SLC
I would especially consider point 4 and 8. What kind of company would not want to spread what they do offer?
Marcel
A: 

An interesting IRL version of story involving The Harvard Coop: Coop Calls Cops On ISBN Copiers.

ilya n.
+1  A: 

To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.

ilya n.
+1  A: 

One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.

Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?

If too many people violate robots.txt we might get something worse.

Nosredna
Indeed, this is where ignoring robots.txt will lead us: http://www.theonion.com/content/video/in_the_know_are_we_giving_the
ilya n.
A: 

I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.

Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.

But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)

MattK311
It's called robots.txt, not search-engines.txt. It's for all automated Web crawlers to obey -- anything not interactively operated by a human. Besides, it's a funny state of mind that thinks something that searching content on the Internet isn't a search engine.
Rob Kennedy
Like I said, "selective ignorance". But yeah, I agree with what you're saying.
MattK311
A Bot would be any automated scraper that goes against a web site and retrieves information. IMHO, it does not matter if the software is written by an individual or a company.
Raj More
+1  A: 

Short answer: No.

On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.

Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.

In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)

Jay
+2  A: 

Many people have tried to build businesses off building "price comparison" engines that scraped major sites.

Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.

You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.

If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.

Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.

Frank Farmer