ansaurus

Question

Getting all pdf files from a domain (for example *.adomain.com)

Answer 1

A:

If the links to the files have been removed, and you have no permission to list the directories, it's basically impossible to know behind what URL there is a pdf-file.

You could have a look at http://www.archive.org and look up a previous state of the page if you believe there has been links to the files in the past.

To retrieve all pdfs mentioned on the site recursively I recommend wget. From the examples at http://www.gnu.org/software/wget/manual/html_node/Advanced-Usage.html#Advanced-Usage

You want to download all the gifs from a directory on an http server. You tried ‘wget http://www.server.com/dir/*.gif’, but that didn't work because http retrieval does not support globbing. In that case, use:
     wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
More verbose, but the effect is the same. ‘-r -l1’ means to retrieve recursively (see Recursive Download), with maximum depth of 1. ‘--no-parent’ means that references to the parent directory are ignored (see Directory-Based Limits), and ‘-A.gif’ means to download only the gif files. ‘-A "*.gif"’ would have worked too.

(Simply replace .gif with .pdf!)

aioobe 2010-06-05 09:09:10

+1 Thanks!!! I 'll try that

Zack 2010-06-05 09:16:38

Can't thank you enough, works like a charm! One more thing, searched throught wget documentation couldn't find it: Is there a way to tell wget to download from all domains with url, for example, *.gov.co.uk. (without actually listing all the domains )

Zack 2010-06-05 09:35:00

How could *any* system conceivably know all of the subdomains that are allowed for a particular domain?

Gareth 2010-06-05 09:51:41

ansaurus

tags:

views:

answers:

Getting all pdf files from a domain (for example *.adomain.com)

related questions