views:

78

answers:

5

given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist?

I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this?

also, assume that the site does not let you visit something like www.mysampleurl.com/files/

A: 

Would wget or curl work? They're both unix command line apps that can pull down an entire site to disk for you.

Dean J
I will look into those. thanks
deming
man wget: http://linux.die.net/man/1/wget
extraneon
A: 

You'd be better to use a web crawler of some ilk, rather than trying to write something yourself from scratch. HTTrack, for example, should do what you need.

me_and
I have not dealt much with crawlers. so yeah looking for links on what is already out there for something like this.
deming
A: 

If you can use the Javascript console:

for (var a = document.getElementsByTagName("a"), i = 0; i != a.length; ++i) {
    if (m = /http:\/\/(.+\.pdf)/.exec(a[i])) {
        var pdfLink = m[1];
        console.log(pdfLink);
    }
}
Yassin
nice. but that will work on one html page and not the whole site correct?
deming
@deming: yes, that's right
Yassin
A: 

If you are looking to use Python ... you may want to take a look at Scrapy .

From their FAQ :

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them

Given that , it sounds like it would be very possible to use this to crawl the site looking for links to .PDF's

CaseyIT
A: 

As you want to extract from all the links from your site it's gonna be huge data and obviously speed and accuracy both should be considered. How about simple point-n-click extraction of data from web? I have been using Automation Anywhere since a long time for such type of data extraction. You can through some details on it and try it if it helps you.

Paul Berry