ansaurus

Question

How can I extract URLs from plain text with Perl?

Answer 1

+10 A:

You want URI::Find. Once you extract the links, you should be able to handle the rest of the problem just fine.

This is answered in perlfaq9's answer to "How do I extract URLs?", by the way. There is a lot of good stuff in those perlfaq. :)

brian d foy 2010-04-02 01:56:48

the link appears to be broken

MadCoder 2010-04-02 02:01:10

I typed the wrong package name in the link, but I fixed it.

brian d foy 2010-04-02 02:07:08

Answer 2

+4 A:

Besides URI::Find, also checkout the big regular expression database: Regexp::Common, there is a Regexp::Common::URI module that gives you something as easy as:

my ($uri) = $str =~ /$RE{URI}{-keep}/;

If you want different pieces (hostname, query parameters etc) in that uri, see the doc of Regexp::Common::URI::http for what's captured in the $RE{URI} regular expression.

gugod 2010-04-02 04:06:40

Regex::Common is an awesome set of tools. Almost every regex you could think of already exists there. It's sad that people keep reinventing them :(

Robert P 2010-04-02 16:28:53

Answer 3

+1 A:

When I tried URI::Find::Schemeless with the following text:

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://[email protected]/test/me
How about one without a protocol www.example.com?

it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</&lt;/g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).

Sinan Ünür 2010-04-02 06:10:04

@Sinan - This was a little more complicated than what I was hoping for but ultimately it was the only solution that correctly captured links that were missing the 'http://' part of the URL which I'm assuming will be a form many of our users will enter websites into our form. Thanks for your help!

Russell C. 2010-04-02 16:26:25

Answer 4

+1 A:

Here I have posted the sample code using how to extract the urls. Here it will take the lines from the stdin. And it will check whether the input line contains valid URL format. And it will give you the URL

use strict;
use warnings;

use Regexp::Common qw /URI/;

while (1)
{
        #getting the input from stdin.
        print "Enter the line: \n";
        my $line = <>;
        chomp ($line); #removing the unwanted new line character
        my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/       and  print "Contains an HTTP URI.\n";
        print "URL : $uri\n" if ($uri);
}

Sample output I am getting is as follows

Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com

2010-04-02 06:36:12

@thillai - This doesn't seem to work for URLs starting with 'https' or those missing the 'http://' like your 'www.google.com' example above. Any ideas on how to change your suggested implementation to successfully handle those cases?

Russell C. 2010-04-02 15:12:19

ansaurus

tags:

views:

answers:

How can I extract URLs from plain text with Perl?

related questions