tags:

views:

125

answers:

2

I want to download about 200 different html files over HTTPS and extract the title of the page from each file and put the titles into a text document. How would I go about using Perl to download files using https? I searched google, but didn't find very much helpful info or examples.

+9  A: 

Take a look at HTML::HeadParser, part of the HTML::Parser distribution. It will parse an HTML header for you to extract the <title> tag contents.

For fetching HTML content, there are a huge number of cpan modules available. One such module is LWP::Curl, which belongs to the libwww-perl family. Search around on this site for many discussions of fetching HTML to learn more.

For downloading over https, take a look at the documentation under libwww-perl. The current "standard" way to use SSL under libwww-perl is via Crypt::SSLeay.

Ether
+1  A: 

A good place to look for information on the downloading part is the libwww-perl cookbook.

Here's some rudimentary sample code. It isn't necessarily the best way, but it's one that should work, assuming you have the LWP module (available from CPAN).

#!/usr/bin/perl

use warnings;
use strict;
use LWP::Simple;

while (my $site = <STDIN>)
{
    my $doc = get $site;
    if (defined($doc))
    {
        if ( $doc =~ m/<title>(.*)<\/title>/i )
        {
           print "$1\n";
        }
    }
}

You might want to add more bells and whistles, for unescaping text, handling error conditions, doing requests in parallel with multiple threads, faking user-agent as Mozilla etc :)

If you saved this as titlegrab.pl, and you had a list of sites in sites.list (one URL per line), you could use this with $ cat sites.list | perl titlegrab.pl to see all the titles.

Or.. redirect to some output file, e.g. $ cat sites.list | perl titlegrab.pl > results.txt

any ideas why I get this returned, when I use the HTTPS example from the link you gave me:500 Can't locate object method "new" via package "LWP::Protocol::https::Socket"
Silmaril89
You probably need to manually install the Net::SSL module, which LWP depends on for https. Again, this should be easy to do via CPAN.
Don't read from STDIN - read from ARGV (or <> in this case). Also don't use regular expressions (and broken at that) to parse HTML - use an HTML parser.
Shlomi Fish
I did say it was rudimentary! I ran this on an example list of URLs and it worked fine for me. I agree on the HTML parser, but in the interest of keeping things short for this simple task, the regex seemed adequate. Could you describe the brokenness in more detail?
What if the title tag and it's contents are not on a single line? Your regex fails. Use HTML::HeadParser instead. Also, you forgot to mention how to add HTTPS support: install Crypt::SSLeay before you run your code. And, you win the "Useless use of cat" award. :)
brian d foy
Points taken. Yes, this will only work for titles on the same line. You're right - it depends on Crypt::SSLeay, which in turn depends on Net::SSL. In my case, Net::SSL was the only missing part from a default installation. I can live with the overhead of 'cat'.
How is it "keeping things short" to write a lot of extra unnecessary code?
hobbs