views:

293

answers:

6

Problem: to find answers and exercises of lectures in Mathematics at Uni. Helsinki

Practical problems

  1. to make a list of sites with .com which has Disallow in robots.txt
  2. to make a list of sites at (1) which contain files with *.pdf
  3. to make a list of sites at (2) which contain the word "analyysi" in pdf-files

Suggestions for practical problems

  1. Problem 3: to make a compiler which scrapes data from pdf-files

Questions

  1. How can you search .com -sites which are registered?
  2. How would you solve the practical problems 1 & 2 by Python's defaultdict and BeautifulSoap?
+3  A: 

If I understand your requirements, you'd essentially have to spider every possible site in order to see which one(s) match your criteria. I don't see any faster or more efficient solution, regardless of what tools you use.

Head Geek
@Head Geek: Have you any idea how long that would take? My calculations give me a result of one month.
Masi
Depends on a lot of things, e.g. what kind of equipment you will be using. Obviously if you're Google and you have many hundreds of thousands of dedicated servers, it can be fairly quick. If you're working on just a few standard PCs... my intuition suggests years/decades/centuries, though I don't know the actual numbers involved.
David Zaslavsky
@David: Does Google have a database of websites which are blocked by robots.txt? - In other words, is it possible to search an internet site which has a strict robots.txt with Google? -- I have been able to search apparently an internet site blocked by robots.txt by the following command in Google % site:mySite.com filetype:pdf Analysis %
Masi
Use "disallow filetype:txt" in your favorites search engines such as google, bing and you will get a few domains...
merkuro
+1  A: 

If I understand you correctly then I don't see how this is possible without, as mentioned already, scanning the entire internet. You are looking for pages on the internet which are not on Google? There is not a database of every site on the net and if they are indexed by a search engine or not...

You would literally need to index the entire web and then go though each site and check if they are on google.

I am also confused if this relates in one site or the web since your question seems to switch between both.

Damien
@Damien: I need to find one site which blocks default web-spiders. -- It may be possible to make just a list of sites which block default webSpiders, and then searching these sites. @ I am also unsure what is the best way to find the Wanted Site. @ Feel free to edit my question if you see some parts confusing.
Masi
There isn't a list of sites which are not in Google, You would need to find them yourself..
Damien
Well I guess Google would have a list of such sites (not their contents though, as I assume someone would have blown the whistle when seeing Google in the access logs). But indeed: Google does not provide any means to get that list, unless you get a job there.
Arjan
A: 

Do you mean that you have your lectures on a web page of your University's intranet and that you would like to be able to access this page from outside your University's intranet?

I assume that in order to access your Uni's intranet you must enter a password, and that Google does not index any of the Uni's intranet pages -- which is the nature of an intranet.

If all the above assumptions are correct then you simply need to host your pdf files on a website outside your University's intranet. Simplest way is to start a blog (no cost involved and very easy and quick to do) and then post your pdf files there.

Google will then index your pages and also "scrape data" from your pdf's as you put it, which means that the text within your pdf files will be searchable.

mattRo55
@Matt: I do not mean that my lectures have exercises in the intranet. They are open in the Internet, blocked by robots.txt such that Google's web spiders will not search the content in the site.
Masi
So you're looking for your own material (anywhere on the net)?
Arjan
+6  A: 

I am trying to find every web site on the internet that has a pdf-file which has the word "Analyysi"

Not an answer to your question, but: PLEASE respect the site owner's wish to NOT be indexed.

Arjan
Are they connected: rare references in the books about Mathematics and their sites with robots.txt? Hopefully, Steven Levitt is in SO :)
Masi
+3  A: 

Your questions are faulty.

With respect to (2), you are making the faulty assumption that you can find all PDF files on a webserver. This is not possible, for multiple reasons. The first reason is that not all documents may be referenced. The second reason is that even if they are referenced, the reference itself may be invisible to you. Finally, there are PDF resources which are generated on the fly. That means they do not exist until you ask for them. And since they depend on your input, there's an infinite amount of them.

Question 3 is faulty for pretty much the same reasons. In particular, the generated PDF may contain the word "analyysi" only if you used it in the query. E.g. http://example.com/makePDF.cgi?analyysi

MSalters
Thank you for your answer!
Masi
A: 

I outline:

1. Law

"The problem comes with enforcing that law! In principal it is easy, in practice it is expensive!" source

"There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases." source

2. Practise

disallow filetype:txt

3. Theoretically Possible?

Masi