Get hyperlinks to all PDFs on a website

views:

answers:

Get hyperlinks to all PDFs on a website

I need to download all PDFs (or any other files) available on a given website. I have a separate download module to which I provide a link to a pdf file and it downloads the file.I need a tool that can crawl through a given website and extract hyperlinks of all the pdf files available on the site so that I can send them to my download module one by one and download only those files which match a particular pattern. Is there any such crawler (front end or back end) available free of cost? Even a paid crawler would do and even if it extracts all the hyperlinks available on site, no matter if the links correspond to a downloadable file or not.

You can use Google: "site:yoursite.com filetype:pdf" and then export results to CSV with http://seoquake.com/.

Sam Dark 2010-03-10 18:15:52

you could use WGET in a script file. Should handle what you want

TheSteve0 2010-03-10 18:16:09

I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf server.com"; but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.

Saubhagya 2010-03-13 05:14:51

Try wget with the -m (mirror) command first - that should be aggressive enough to get everything. If that works then start playing with some of the other options. I would reccomend trying this on one of your own servers (or a friends) first. Aggessive downloading can sometimes be met with IP banning. I am not talking about legality - I just know that aggressive downloading or crawling can lead to IP banning.

TheSteve0 2010-03-14 04:47:09

Perhaps you are looking for a program like HTTrack, which will let you set up filters to download only particular files on a site.

You might also have luck with using wget to crawl a site. Here is a tutorial for doing just that.

ty 2010-03-10 18:17:11

I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf http://www.server.com" but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.

Saubhagya 2010-03-12 19:48:15

Are both directories discoverable from the starting point? It won't know about the 'results' directory unless there's a link to it off server.com.

ty 2010-03-12 20:41:03

Yes, I suppose there's some problem with the switches used with wget, because when I use WebRipper (a GUI tool), it extracts all of them.

Saubhagya 2010-03-13 05:09:53

DownThemAll is a Firefox plugin that does just what you want, but I don't know if you can use it programmatically, if that's what you are looking for.

Pat 2010-03-10 18:23:08

ansaurus

tags:

views:

answers:

Get hyperlinks to all PDFs on a website

related questions