ansaurus

Question

How can I get the file extensions from relative links in HTML text using Perl?

Answer 1

+2 A:

First of all, extract the names with an HTML parser of your choice. You should then have something like an array containing the names, as if produced like this:

my @names = ("http://foo.bar.net/quux",
             "boink.bak",
             "mms://three.two.one"
             "hello.jpeg");

The only way to distinguish domain names from file extensions seems to be that in "file names", there is at least one more slash between the :// part and the extension. Also, a file extension can only be the last thing in the string.

So, your regular expression would be something like this (untested):

^(?:(?:\w+://)?(?:\w+\.)+\w+/)?.*\.(\w+)$

Svante 2010-03-26 15:31:28

Looking more into parsers and will then try this out.

Structure 2010-03-26 15:46:47

No need to parse this stuff yourself. Something like HTML::SimpleLinkExtor does it in a couple lines for you.

brian d foy 2010-03-26 16:47:50

Answer 2

A:

#!/usr/bin/perl -w

use strict;

while (<>) {
    if (m/(?<=(?:ref=|src=|rel=))"([^<>"]+?\.([0-9A-Za-z]+?))"/g) {
       if ($1 !~ /:\/\//) {
            print $2 . "\n";
       }
    }
}

Used positive lookbehind to get only the stuff between doublequotes behind one of the 'link' attributes (scr=, rel=, href=). Fixed to look at "://" for recognizing URL's, and allow files with absolute paths.

@Structure : There is no proper way to protect against someone leaving off the protocol part, as it would just turn into a legitimate pathname : http://www.noo.com/afile.cfg -> www.noo.com/afile.cfg. You would need to wget (or something) all of the links to make sure they are actually there. And that's an entirely different question...

Yes, I know I should use a proper parser, but am just not feeling like it right now :P

Powertieke 2010-03-26 15:43:06

This doesn't cleanly extract the strings, relies on a complete enumeration of the possible top level domains, doesn't provide even close to such an enumeration, and even if it did, it would fail when an extension would be the same as a top level domain.

Svante 2010-03-26 16:19:41

As Svante pointed out, there are cases where this would fail and I would need to list out all TLDs so they did not match. Given this I am inclined to believe using a parser is a better long-term solution.

Structure 2010-03-27 00:54:45

Fixed to fit Svante's comments. I also took Svante's idea to check for the "://" part of URLs to filter them out. As far as i am concerned everyone's point of "USE A PARSER" is now proven. Unless you really wanna write creepy regexes :)

Powertieke 2010-03-27 12:50:01

My version of Perl is returning the filename for $0, so I rewrote the main regex to m/((?<=(?:ref=|src=|rel=))"[^<>\/"][^<>"]+?\.([0-9A-Za-z]+?)")/g and updated $0 to $1, and $1 to $2. Then this worked. Mileage may vary.

Structure 2010-03-29 03:06:16

Poking around with this further trying to see how far I can push the regex method anyway. I got more positive matches when I removed [^<>\/"] from the expression. This will allow matches for "../file.ext", "/file.ext" and "file.ext". However, the expression and if statement method will fail if there is a poorly formatted link, such as one that is lacking a protocol (http://). You could work around this with another if statement that uses an expression to check for more than one period if you really wanted. In any case, as already stated, USE A PARSER! :)

Structure 2010-03-29 07:32:38

Poorly formatted links without protocol would just make it a bad server path :)

Powertieke 2010-03-29 09:26:22

Answer 3

+6 A:

#!/usr/bin/perl

use strict; use warnings;
use File::Basename;
use HTML::TokeParser::Simple;
use URI;

my $parser = HTML::TokeParser::Simple->new( \*DATA );

while ( my $tag = $parser->get_tag('a') ) {
    my $uri = URI->new( $tag->get_attr('href') );
    my $ext = ( fileparse $uri->path, qr/\.\w+\z/ )[2];
    print "$ext\n";
}

__DATA__
<p><a href="../test.png">link</a> <a
href="http://www.example.com/test.jpg"&gt;link on example.com</a>
</p>

Sinan Ünür 2010-03-26 16:08:36

Taking a look at this also. Thanks!

Structure 2010-03-27 00:52:34

ansaurus

tags:

views:

answers:

How can I get the file extensions from relative links in HTML text using Perl?

related questions