is it possible to extract all PDFs from a site

views:

answers:

is it possible to extract all PDFs from a site

given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist?

I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this?

also, assume that the site does not let you visit something like www.mysampleurl.com/files/

Would wget or curl work? They're both unix command line apps that can pull down an entire site to disk for you.

Dean J 2010-04-08 14:57:16

I will look into those. thanks

deming 2010-04-08 14:58:59

man wget: http://linux.die.net/man/1/wget

extraneon 2010-04-08 15:01:23

You'd be better to use a web crawler of some ilk, rather than trying to write something yourself from scratch. HTTrack, for example, should do what you need.

me_and 2010-04-08 14:58:36

I have not dealt much with crawlers. so yeah looking for links on what is already out there for something like this.

deming 2010-04-08 15:00:08

If you can use the Javascript console:

for (var a = document.getElementsByTagName("a"), i = 0; i != a.length; ++i) {
    if (m = /http:\/\/(.+\.pdf)/.exec(a[i])) {
        var pdfLink = m[1];
        console.log(pdfLink);
    }
}

Yassin 2010-04-08 15:07:49

nice. but that will work on one html page and not the whole site correct?

deming 2010-04-08 15:15:25

@deming: yes, that's right

Yassin 2010-04-08 21:56:56

If you are looking to use Python ... you may want to take a look at Scrapy .

From their FAQ :

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them

Given that , it sounds like it would be very possible to use this to crawl the site looking for links to .PDF's

CaseyIT 2010-04-08 15:43:16

As you want to extract from all the links from your site it's gonna be huge data and obviously speed and accuracy both should be considered. How about simple point-n-click extraction of data from web? I have been using Automation Anywhere since a long time for such type of data extraction. You can through some details on it and try it if it helps you.

Paul Berry 2010-05-07 06:55:31

ansaurus

tags:

views:

answers:

is it possible to extract all PDFs from a site

related questions