ansaurus

Question

How can I format URLs nicely in Perl?

Answer 1

+1 A:

Try the URI module from cpan.

Powertieke 2010-09-20 07:27:59

Why? It's not like it does any of the things he's asking for?

Nic Gibson 2010-09-20 11:01:16

URI::split does exactly what he asks for... i think.

Powertieke 2010-09-20 11:25:54

Sure the URI module does what he's asking for. I show you how to do it.

brian d foy 2010-09-20 23:14:28

Answer 2

A:

Part of the joy of Perl is not relying on modules :) I managed the following solution:


#!/usr/bin/perl -w

use strict;

my @url_list = ("<a href=http://www.test.com&gt;www.test.com&lt;/a&gt;",
                "<a href=http://www.example.com&gt;www.example.com&lt;/a&gt",
                "<a href=http://www.this.com&gt;www.this.com&lt;/a&gt");

my ($protocol, $domain_name);

foreach my $url (@url_list) {
    $url =~ m|(\w+)://([^/:]+)(:\d+)?/(.*)|;
    $protocol = $1;
    $domain_name = $2;
    my ($url_part, $name_part) = split(/>/, $domain_name);
    $name_part =~ s/\<//g;
    print $protocol, "://" ,$url_part, " -> ", $name_part  , "\n";
}

It's not awesome, and I ended up with a stray < in the domain name that took a substitute to remove. To answer your original question, you can combine LWP::Simple and HTML::LinkExtor to download and parse HTML docs from the web. Powerful combo.

** Disclaimer: Since Ruby and Python, my Perl sucks. Apologies to the purists for brutalizing your language.

Paul Stevens 2010-09-20 08:42:04

Part of the joy of modules is handling all of the edge cases correctly. :)

brian d foy 2010-09-20 13:16:44

@brian, I reread that "code" and flinched. Somebody please hit me with something heavy!

Paul Stevens 2010-09-20 13:40:13

What code? If you're talking about your code, I flinched too. There are a lot of errors in it. If it's the URI code, it's probably because you don't actually know Perl, as you say.

brian d foy 2010-09-20 13:55:24

I was talking about my efforts. I appreciate the fact that people will point out that I'm horribly wrong, and then post better code that I can learn from. Maybe one day I'll actually crack 'average' :)

Paul Stevens 2010-09-21 06:52:09

Answer 3

+1 A:

I'm not quite sure what you exactly want. I guess you want to strip out http:// and have a shortened url to be displayed. If it's the case you can do something like :

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;


my @url_list = ('http://www.foo.com/','http://www.foo.com/long_path');

for my $url (@url_list) {
    (my $short = $url) =~ s!\w+://!!;
    $short =~ s!/$!!;
    $short =~ s!^(.{15}).*$!$1...!;
    say "<a href='$url'>$short</a>";
}

Output:

<a href='http://www.foo.com/'&gt;www.foo.com&lt;/a&gt;
<a href='http://www.foo.com/long_path'&gt;www.foo.com/lon...&lt;/a&gt;

M42 2010-09-20 10:06:17

This makes unwarrented assumptions about the format of the URI. Not every URI has the same parts. Try it with some of my sample data.

brian d foy 2010-09-20 13:56:34

@brian: Yes, you're right, but it works for examples given by OP. I presume he could adapt to his needs.

M42 2010-09-20 14:24:21

If by "adapt" you mean use something else, you're right. Remember, you don't code to only handle the examples you're given. You code to handle the examples people haven't considered but that you now exist.

brian d foy 2010-09-20 15:59:32

Answer 4

+4 A:

The trick is figuring out how you want to pretty-print each sort of URL, so in that case you need to tell your script what to do in each case:

use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $s = $uri->scheme;
    my $rest = do {
        if( $s =~ /(?:https?|ftp)/ ) {
            $uri->host . $uri->path_query
            }
        elsif( $s eq 'mailto' ) {
            $uri->path
            }
        elsif( ! $s ) {
            $uri
            }
        };

    print "$uri -> $rest\n";
    }

__END__
http://www.example.com/foo/bar.html
www.example.com/foo/bar.html
ftp://www.example.com
mailto:[email protected]
https://www.example.com/foo?a=b;c=d
http://joe:[email protected]/login

This produces:

http://www.example.com/foo/bar.html -> www.example.com/foo/bar.html
www.example.com/foo/bar.html -> www.example.com/foo/bar.html
ftp://www.example.com -> www.example.com
mailto:[email protected] -> [email protected]
https://www.example.com/foo?a=b;c=d -> www.example.com/foo?a=b;c=d
http://joe:[email protected]/login -> www.example.com/login

If you want something different for a particular URL, you just need to make a branch for it and put together the parts that you want. Notice the URI also handles schemeless URIs.

If you don't want long URI strings for your pretty printing, you might throw in something like this to cut off the string after so many characters:

substr( $rest, 20 ) = '...' if length $rest > 20;

Here's a solution with given, which is slightly cleaner, but also slightly uglier. This is the Perl 5.010 version:

use 5.010;
use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $r;
    given( $uri->scheme ) {
        when( /(?:https?|ftp)/  ) { $r = $uri->host . $uri->path_query }
        when( 'mailto' )          { $r = $uri->path }       
        default                   { $r = $uri }
        }


    print "$uri -> $r\n";
    }

It's uglier because I have to repeat that assignment to $r. Perl 5.14 is going to fix that though be letting given have a return value. Since that stable version isn't available yet, you have to use the experimental 5.13 track:

use 5.013004;
use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $r = do {
        given( $uri->scheme ) {
            when( /(?:https?|ftp)/  ) { $uri->host . $uri->path_query }
            when( 'mailto' )          { $uri->path }        
            default                   { $uri }
            }
        };

    print "$uri -> $r\n";
    }

brian d foy 2010-09-20 13:35:20

Answer 5

+5 A:

Appendix B of RFC 2396 specifies a regular expression that parses a URI reference. Adapt that a bit to get what you want:

#! /usr/bin/perl

use warnings;
use strict;

use 5.10.0;  # for defined-or (//)

my $uri = qr{
  ^
  (?:([^:/?\#]+):)?  # scheme = $1
  (?://([^/?\#]*))?  # authority = $2
  ([^?\#]*)          # path = $3
  (\?[^\#]*)?        # query = $4
  (\#.*)?            # fragment = $5
}x;

The code above uses the /x modifier

It tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code.

but we want to match literal # characters if they're present, which meant I needed to escape them with backslashes. Out of habit, I started with qr/ but had to change the delimiter because of the slashes in the pattern.

A few test cases:

my @cases = qw(
  ftp://www.foo.com.invalid/
  http://www.foo.com.invalid/
  http://www.foo.com.invalid/long_path
  http://www.foo.com.invalid/?query
  http://www.foo.com.invalid?query
  http://www.foo.com.invalid/#fragment
  http://www.foo.com.invalid#fragment
);

A bit of logic

for (@cases) {
  my $nice;
  if (my($scheme,$auth,$path,@rest) = /$uri/) {
    if ($scheme eq "http" && defined $auth) {
      if (grep defined, @rest) {
        $nice = join "" => map $_ // "" => $auth, $path, @rest;
      }
      else {
        $nice = $auth
              . ($path eq "/" ? "" : $path);
      }
    }
    else {
      $nice = $_;
    }
  }

  print "$_ → $nice\n";
}

and the output:

ftp://www.foo.com.invalid/ → ftp://www.foo.com.invalid/
http://www.foo.com.invalid/ → www.foo.com.invalid
http://www.foo.com.invalid/long_path → www.foo.com.invalid/long_path
http://www.foo.com.invalid/?query → www.foo.com.invalid/?query
http://www.foo.com.invalid?query → www.foo.com.invalid?query
http://www.foo.com.invalid/#fragment → www.foo.com.invalid/#fragment
http://www.foo.com.invalid#fragment → www.foo.com.invalid#fragment

Greg Bacon 2010-09-20 13:58:37

ansaurus

tags:

views:

answers:

How can I format URLs nicely in Perl?

related questions