ansaurus

Question

How can I extract URLs from plain text with Perl?

Answer 1

+5 A:

Edit: Even if you don't want a canned regular expression, it may help you to look at the source of a tested module that works.

If you want to find URLs that match a certain string, you can easily use this module to do that.

#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw/URI/;

while (<>) {
  if (m/$RE{URI}{HTTP}{-keep}/) {
    print $_ if $1 =~ m/what-you-want/;
  }
}

Telemachus 2009-06-27 18:15:37

Answer 2

+7 A:

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

Schwern 2009-06-27 18:29:40

Answer 3

A:

i thought that shouldn't happen because i am using .*? which ought to be non-greedy and give me the smallest match

It does, but it gives you the smallest match going right. Starting from the first http and going right, that's the smallest match.

Please note for the future, you don't have to escape the slashes, because you don't have to use slashes as your separator. And you don't have to escape the colon either. Next time just do this:

m|(http://.*?homepage.com\/.*?\.gif)|

or

m#(http://.*?homepage.com\/.*?\.gif)#

or

m<(http://.*?homepage.com\/.*?\.gif)&gt;

or one of lots of other characters, see the perlre documentation.

AmbroseChapel 2009-06-28 11:35:32

OK just out of curiosity, why the downvote?

AmbroseChapel 2009-06-29 01:56:13

Answer 4

+1 A:

URLs aren't allowed to contain spaces, so instead of .*? you should use \S*?, for zero-or-more non-space characters.

DougWebb 2009-06-30 04:39:01

ansaurus

tags:

views:

answers:

How can I extract URLs from plain text with Perl?

related questions