views:

42

answers:

4

I need to download all PDFs (or any other files) available on a given website. I have a separate download module to which I provide a link to a pdf file and it downloads the file.I need a tool that can crawl through a given website and extract hyperlinks of all the pdf files available on the site so that I can send them to my download module one by one and download only those files which match a particular pattern. Is there any such crawler (front end or back end) available free of cost? Even a paid crawler would do and even if it extracts all the hyperlinks available on site, no matter if the links correspond to a downloadable file or not.

A: 

You can use Google: "site:yoursite.com filetype:pdf" and then export results to CSV with http://seoquake.com/.

Sam Dark
A: 

you could use WGET in a script file. Should handle what you want

TheSteve0
I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf server.com"; but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
Saubhagya
Try wget with the -m (mirror) command first - that should be aggressive enough to get everything. If that works then start playing with some of the other options. I would reccomend trying this on one of your own servers (or a friends) first. Aggessive downloading can sometimes be met with IP banning. I am not talking about legality - I just know that aggressive downloading or crawling can lead to IP banning.
TheSteve0
A: 

Perhaps you are looking for a program like HTTrack, which will let you set up filters to download only particular files on a site.

You might also have luck with using wget to crawl a site. Here is a tutorial for doing just that.

ty
I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf http://www.server.com" but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
Saubhagya
Are both directories discoverable from the starting point? It won't know about the 'results' directory unless there's a link to it off server.com.
ty
Yes, I suppose there's some problem with the switches used with wget, because when I use WebRipper (a GUI tool), it extracts all of them.
Saubhagya
A: 

DownThemAll is a Firefox plugin that does just what you want, but I don't know if you can use it programmatically, if that's what you are looking for.

Pat