I need to download all PDFs (or any other files) available on a given website. I have a separate download module to which I provide a link to a pdf file and it downloads the file.I need a tool that can crawl through a given website and extract hyperlinks of all the pdf files available on the site so that I can send them to my download module one by one and download only those files which match a particular pattern. Is there any such crawler (front end or back end) available free of cost? Even a paid crawler would do and even if it extracts all the hyperlinks available on site, no matter if the links correspond to a downloadable file or not.
A:
You can use Google: "site:yoursite.com filetype:pdf" and then export results to CSV with http://seoquake.com/.
Sam Dark
2010-03-10 18:15:52
I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf server.com"; but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
Saubhagya
2010-03-13 05:14:51
Try wget with the -m (mirror) command first - that should be aggressive enough to get everything. If that works then start playing with some of the other options. I would reccomend trying this on one of your own servers (or a friends) first. Aggessive downloading can sometimes be met with IP banning. I am not talking about legality - I just know that aggressive downloading or crawling can lead to IP banning.
TheSteve0
2010-03-14 04:47:09
A:
Perhaps you are looking for a program like HTTrack, which will let you set up filters to download only particular files on a site.
You might also have luck with using wget
to crawl a site. Here is a tutorial for doing just that.
ty
2010-03-10 18:17:11
I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf http://www.server.com" but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
Saubhagya
2010-03-12 19:48:15
Are both directories discoverable from the starting point? It won't know about the 'results' directory unless there's a link to it off server.com.
ty
2010-03-12 20:41:03
Yes, I suppose there's some problem with the switches used with wget, because when I use WebRipper (a GUI tool), it extracts all of them.
Saubhagya
2010-03-13 05:09:53
A:
DownThemAll is a Firefox plugin that does just what you want, but I don't know if you can use it programmatically, if that's what you are looking for.
Pat
2010-03-10 18:23:08