tags:

views:

157

answers:

2

I have a wget-like script which downloads a page and then retrieves all the files linked in IMG tags on that page.

Given the URL of the original page and the the link extracted from the IMG tag in that page I need to build the URL for the image file I want to retrieve. Currently I use a function I wrote:

sub build_url {
    my ( $base, $path ) = @_;

    # if the path is absolute just prepend the domain to it
    if ($path =~ /^\//) {
        ($base) = $base =~ /^(?:http:\/\/)?(\w+(?:\.\w+)+)/;
        return "$base$path";
    }

    my @base = split '/', $base;
    my @path = split '/', $path;

    # remove a trailing filename
    pop @base if $base =~ /[[:alnum:]]+\/[\w\d]+\.[\w]+$/;

    # check for relative paths
    my $relcount = $path =~ /(\.\.\/)/g;
    while ( $relcount-- ) {
        pop @base;
        shift @path;
    }
    return join '/', @base, @path;
}

The thing is, I'm surely not the first person solving this problem, and in fact it's such a general problem that I assume there must be some better, more standard way of dealing with it, using either a core module or something from CPAN - although via a core module is preferable. I was thinking about File::Spec but wasn't sure if it has all the functionality I would need.

+3  A: 

URI -- for building
HTML::TreeBuilder -- for parsing.

eugene y
@eugene y: Thanks, have any suggestions for doing it using only core modules?
Robert S. Barnes
@Robert: paste the code from these modules into your script :-)
eugene y
@eugene y: Ahh, the old "Use the Source Luke" comeback :-) I'll take a look.
Robert S. Barnes
A: 

It sounds like you might want something like my HTML::SimpleLinkExtor module. That's what I use for my wget-like script called webreaper.

brian d foy