I need to download all PDFs (or any other files) available on a given website. I have a separate download module to which I provide a link to a pdf file and it downloads the file.I need a tool that can crawl through a given website and extract hyperlinks of all the pdf files available on the site so that I can send them to my download module one by one and download only those files which match a particular pattern. Is there any such crawler (front end or back end) available free of cost? Even a paid crawler would do and even if it extracts all the hyperlinks available on site, no matter if the links correspond to a downloadable file or not.
                
                A: 
                
                
              
            You can use Google: "site:yoursite.com filetype:pdf" and then export results to CSV with http://seoquake.com/.
                  Sam Dark
                   2010-03-10 18:15:52
                
              I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf server.com"; but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
                  Saubhagya
                   2010-03-13 05:14:51
                Try wget with the -m (mirror) command first - that should be aggressive enough to get everything. If that works then start playing with some of the other options. I would reccomend trying this on one of your own servers (or a friends) first. Aggessive downloading can sometimes be met with IP banning. I am not talking about legality - I just know that aggressive downloading or crawling can lead to IP banning.
                  TheSteve0
                   2010-03-14 04:47:09
                
                
                A: 
                
                
              Perhaps you are looking for a program like HTTrack, which will let you set up filters to download only particular files on a site.
You might also have luck with using wget to crawl a site.  Here is a tutorial for doing just that.
                  ty
                   2010-03-10 18:17:11
                
              I used the command "wget -e robots=off -r -l0 --no-parent -A.pdf http://www.server.com" but it downloaded files only from one directory of server where as there are 2 directories containing downloadable files. Like it has "docs" and "results" directories at same hierarchy level but the command downloads files from "docs" only. Please help.
                  Saubhagya
                   2010-03-12 19:48:15
                Are both directories discoverable from the starting point?  It won't know about the 'results' directory unless there's a link to it off server.com.
                  ty
                   2010-03-12 20:41:03
                Yes, I suppose there's some problem with the switches used with wget, because when I use WebRipper (a GUI tool), it extracts all of them.
                  Saubhagya
                   2010-03-13 05:09:53
                
                
                A: 
                
                
              
            DownThemAll is a Firefox plugin that does just what you want, but I don't know if you can use it programmatically, if that's what you are looking for.
                  Pat
                   2010-03-10 18:23:08