ansaurus

Question

How can I download files over HTTPS with Perl?

Answer 1

+9 A:

Take a look at HTML::HeadParser, part of the HTML::Parser distribution. It will parse an HTML header for you to extract the <title> tag contents.

For fetching HTML content, there are a huge number of cpan modules available. One such module is LWP::Curl, which belongs to the libwww-perl family. Search around on this site for many discussions of fetching HTML to learn more.

For downloading over https, take a look at the documentation under libwww-perl. The current "standard" way to use SSL under libwww-perl is via Crypt::SSLeay.

Ether 2010-01-03 19:14:37

Answer 2

+1 A:

A good place to look for information on the downloading part is the libwww-perl cookbook.

Here's some rudimentary sample code. It isn't necessarily the best way, but it's one that should work, assuming you have the LWP module (available from CPAN).

#!/usr/bin/perl

use warnings;
use strict;
use LWP::Simple;

while (my $site = <STDIN>)
{
    my $doc = get $site;
    if (defined($doc))
    {
        if ( $doc =~ m/<title>(.*)<\/title>/i )
        {
           print "$1\n";
        }
    }
}

You might want to add more bells and whistles, for unescaping text, handling error conditions, doing requests in parallel with multiple threads, faking user-agent as Mozilla etc :)

If you saved this as titlegrab.pl, and you had a list of sites in sites.list (one URL per line), you could use this with $ cat sites.list | perl titlegrab.pl to see all the titles.

Or.. redirect to some output file, e.g. $ cat sites.list | perl titlegrab.pl > results.txt

2010-01-03 19:15:22

any ideas why I get this returned, when I use the HTTPS example from the link you gave me:500 Can't locate object method "new" via package "LWP::Protocol::https::Socket"

Silmaril89 2010-01-03 19:33:44

You probably need to manually install the Net::SSL module, which LWP depends on for https. Again, this should be easy to do via CPAN.

2010-01-03 19:43:14

Don't read from STDIN - read from ARGV (or <> in this case). Also don't use regular expressions (and broken at that) to parse HTML - use an HTML parser.

Shlomi Fish 2010-01-03 21:59:32

I did say it was rudimentary! I ran this on an example list of URLs and it worked fine for me. I agree on the HTML parser, but in the interest of keeping things short for this simple task, the regex seemed adequate. Could you describe the brokenness in more detail?

2010-01-03 23:34:55

What if the title tag and it's contents are not on a single line? Your regex fails. Use HTML::HeadParser instead. Also, you forgot to mention how to add HTTPS support: install Crypt::SSLeay before you run your code. And, you win the "Useless use of cat" award. :)

brian d foy 2010-01-04 00:31:15

Points taken. Yes, this will only work for titles on the same line. You're right - it depends on Crypt::SSLeay, which in turn depends on Net::SSL. In my case, Net::SSL was the only missing part from a default installation. I can live with the overhead of 'cat'.

2010-01-04 00:54:52

How is it "keeping things short" to write a lot of extra unnecessary code?

hobbs 2010-01-04 19:30:07

ansaurus

tags:

views:

answers:

How can I download files over HTTPS with Perl?

related questions