ansaurus

Question

Unable to find an internet page blocked by robots.txt

Answer 1

+3 A:

If I understand your requirements, you'd essentially have to spider every possible site in order to see which one(s) match your criteria. I don't see any faster or more efficient solution, regardless of what tools you use.

Head Geek 2009-06-17 22:04:02

@Head Geek: Have you any idea how long that would take? My calculations give me a result of one month.

Masi 2009-06-17 22:15:24

Depends on a lot of things, e.g. what kind of equipment you will be using. Obviously if you're Google and you have many hundreds of thousands of dedicated servers, it can be fairly quick. If you're working on just a few standard PCs... my intuition suggests years/decades/centuries, though I don't know the actual numbers involved.

David Zaslavsky 2009-06-17 22:26:39

@David: Does Google have a database of websites which are blocked by robots.txt? - In other words, is it possible to search an internet site which has a strict robots.txt with Google? -- I have been able to search apparently an internet site blocked by robots.txt by the following command in Google % site:mySite.com filetype:pdf Analysis %

Masi 2009-06-17 22:41:56

Use "disallow filetype:txt" in your favorites search engines such as google, bing and you will get a few domains...

merkuro 2009-06-26 00:48:10

Answer 2

+1 A:

If I understand you correctly then I don't see how this is possible without, as mentioned already, scanning the entire internet. You are looking for pages on the internet which are not on Google? There is not a database of every site on the net and if they are indexed by a search engine or not...

You would literally need to index the entire web and then go though each site and check if they are on google.

I am also confused if this relates in one site or the web since your question seems to switch between both.

Damien 2009-06-18 09:21:55

@Damien: I need to find one site which blocks default web-spiders. -- It may be possible to make just a list of sites which block default webSpiders, and then searching these sites. @ I am also unsure what is the best way to find the Wanted Site. @ Feel free to edit my question if you see some parts confusing.

Masi 2009-06-18 09:34:30

There isn't a list of sites which are not in Google, You would need to find them yourself..

Damien 2009-06-18 09:47:50

Well I guess Google would have a list of such sites (not their contents though, as I assume someone would have blown the whistle when seeing Google in the access logs). But indeed: Google does not provide any means to get that list, unless you get a job there.

Arjan 2009-07-14 22:42:59

Answer 3

A:

Do you mean that you have your lectures on a web page of your University's intranet and that you would like to be able to access this page from outside your University's intranet?

I assume that in order to access your Uni's intranet you must enter a password, and that Google does not index any of the Uni's intranet pages -- which is the nature of an intranet.

If all the above assumptions are correct then you simply need to host your pdf files on a website outside your University's intranet. Simplest way is to start a blog (no cost involved and very easy and quick to do) and then post your pdf files there.

Google will then index your pages and also "scrape data" from your pdf's as you put it, which means that the text within your pdf files will be searchable.

mattRo55 2009-06-18 09:37:17

@Matt: I do not mean that my lectures have exercises in the intranet. They are open in the Internet, blocked by robots.txt such that Google's web spiders will not search the content in the site.

Masi 2009-06-18 09:40:30

So you're looking for your own material (anywhere on the net)?

Arjan 2009-06-26 14:10:16

Answer 4

+6 A:

I am trying to find every web site on the internet that has a pdf-file which has the word "Analyysi"

Not an answer to your question, but: PLEASE respect the site owner's wish to NOT be indexed.

Arjan 2009-06-26 12:27:16

Are they connected: rare references in the books about Mathematics and their sites with robots.txt? Hopefully, Steven Levitt is in SO :)

Masi 2009-07-14 21:41:54

Answer 5

+3 A:

Your questions are faulty.

With respect to (2), you are making the faulty assumption that you can find all PDF files on a webserver. This is not possible, for multiple reasons. The first reason is that not all documents may be referenced. The second reason is that even if they are referenced, the reference itself may be invisible to you. Finally, there are PDF resources which are generated on the fly. That means they do not exist until you ask for them. And since they depend on your input, there's an infinite amount of them.

Question 3 is faulty for pretty much the same reasons. In particular, the generated PDF may contain the word "analyysi" only if you used it in the query. E.g. http://example.com/makePDF.cgi?analyysi

MSalters 2009-06-26 12:47:27

Thank you for your answer!

Masi 2009-06-26 13:15:31

Answer 6

A:

I outline:

1. Law

"The problem comes with enforcing that law! In principal it is easy, in practice it is expensive!" source

"There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases." source

2. Practise

disallow filetype:txt

3. Theoretically Possible?

Masi 2009-07-14 21:32:04

ansaurus

tags:

views:

answers:

Unable to find an internet page blocked by robots.txt

related questions