I want to download about 200 different html files over HTTPS and extract the title of the page from each file and put the titles into a text document. How would I go about using Perl to download files using https? I searched google, but didn't find very much helpful info or examples.
Take a look at HTML::HeadParser, part of the HTML::Parser distribution. It will parse an HTML header for you to extract the <title>
tag contents.
For fetching HTML content, there are a huge number of cpan modules available. One such module is LWP::Curl, which belongs to the libwww-perl family. Search around on this site for many discussions of fetching HTML to learn more.
For downloading over https, take a look at the documentation under libwww-perl. The current "standard" way to use SSL under libwww-perl is via Crypt::SSLeay.
A good place to look for information on the downloading part is the libwww-perl cookbook.
Here's some rudimentary sample code. It isn't necessarily the best way, but it's one that should work, assuming you have the LWP module (available from CPAN).
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
while (my $site = <STDIN>)
{
my $doc = get $site;
if (defined($doc))
{
if ( $doc =~ m/<title>(.*)<\/title>/i )
{
print "$1\n";
}
}
}
You might want to add more bells and whistles, for unescaping text, handling error conditions, doing requests in parallel with multiple threads, faking user-agent as Mozilla etc :)
If you saved this as titlegrab.pl, and you had a list of sites in sites.list (one URL per line), you could use this with $ cat sites.list | perl titlegrab.pl
to see all the titles.
Or.. redirect to some output file, e.g. $ cat sites.list | perl titlegrab.pl > results.txt