tags:

views:

141

answers:

3

For example, scanning the contents of an HTML page with a Perl regular expression, I want to match all file extensions but not TLD's in domain names. To do this I am making the assumption that all file extensions must be within double quotes.

I came up with the following, and it is working, however, I am failing to figure out a way to exclude the TLDs in the domains. This will return "com", "net", etc.

m/"[^<>]+\.([0-9A-Za-z]*)"/g

Is it possible to negate the match if there is more than one period between the quotes that are separated by text? (ie: match foo.bar.com but not ./ or ../)

Edit I am using $1 to return the value within parentheses.

+2  A: 

First of all, extract the names with an HTML parser of your choice. You should then have something like an array containing the names, as if produced like this:

my @names = ("http://foo.bar.net/quux",
             "boink.bak",
             "mms://three.two.one"
             "hello.jpeg");

The only way to distinguish domain names from file extensions seems to be that in "file names", there is at least one more slash between the :// part and the extension. Also, a file extension can only be the last thing in the string.

So, your regular expression would be something like this (untested):

^(?:(?:\w+://)?(?:\w+\.)+\w+/)?.*\.(\w+)$
Svante
Looking more into parsers and will then try this out.
Structure
No need to parse this stuff yourself. Something like HTML::SimpleLinkExtor does it in a couple lines for you.
brian d foy
A: 
#!/usr/bin/perl -w

use strict;

while (<>) {
    if (m/(?<=(?:ref=|src=|rel=))"([^<>"]+?\.([0-9A-Za-z]+?))"/g) {
       if ($1 !~ /:\/\//) {
            print $2 . "\n";
       }
    }
}

Used positive lookbehind to get only the stuff between doublequotes behind one of the 'link' attributes (scr=, rel=, href=). Fixed to look at "://" for recognizing URL's, and allow files with absolute paths.

@Structure : There is no proper way to protect against someone leaving off the protocol part, as it would just turn into a legitimate pathname : http://www.noo.com/afile.cfg -> www.noo.com/afile.cfg. You would need to wget (or something) all of the links to make sure they are actually there. And that's an entirely different question...

Yes, I know I should use a proper parser, but am just not feeling like it right now :P

Powertieke
This doesn't cleanly extract the strings, relies on a complete enumeration of the possible top level domains, doesn't provide even close to such an enumeration, and even if it did, it would fail when an extension would be the same as a top level domain.
Svante
As Svante pointed out, there are cases where this would fail and I would need to list out all TLDs so they did not match. Given this I am inclined to believe using a parser is a better long-term solution.
Structure
Fixed to fit Svante's comments. I also took Svante's idea to check for the "://" part of URLs to filter them out. As far as i am concerned everyone's point of "USE A PARSER" is now proven. Unless you really wanna write creepy regexes :)
Powertieke
My version of Perl is returning the filename for $0, so I rewrote the main regex to m/((?<=(?:ref=|src=|rel=))"[^<>\/"][^<>"]+?\.([0-9A-Za-z]+?)")/g and updated $0 to $1, and $1 to $2. Then this worked. Mileage may vary.
Structure
Poking around with this further trying to see how far I can push the regex method anyway. I got more positive matches when I removed [^<>\/"] from the expression. This will allow matches for "../file.ext", "/file.ext" and "file.ext". However, the expression and if statement method will fail if there is a poorly formatted link, such as one that is lacking a protocol (http://). You could work around this with another if statement that uses an expression to check for more than one period if you really wanted. In any case, as already stated, USE A PARSER! :)
Structure
Poorly formatted links without protocol would just make it a bad server path :)
Powertieke
+6  A: 
#!/usr/bin/perl

use strict; use warnings;
use File::Basename;
use HTML::TokeParser::Simple;
use URI;

my $parser = HTML::TokeParser::Simple->new( \*DATA );

while ( my $tag = $parser->get_tag('a') ) {
    my $uri = URI->new( $tag->get_attr('href') );
    my $ext = ( fileparse $uri->path, qr/\.\w+\z/ )[2];
    print "$ext\n";
}

__DATA__
<p><a href="../test.png">link</a> <a
href="http://www.example.com/test.jpg"&gt;link on example.com</a>
</p>
Sinan Ünür
Taking a look at this also. Thanks!
Structure