views:

91

answers:

3

I've seen some posts like this, but not exactly what I want to do.

How can I extract and delete URL links, and then remove them from plain text.

Example:

"Hello!!, I love http://www.google.es".

I want extract the "http://www.google.es", save it on a variable, and then, remove it from my text.

Finally, the text has to be like that:

"Hello!!, I love".

The URLs usually are the last "word" of the text, but not always.

A: 

If Perl is not a must

$ cat  file
"Hello!!, I love http://www.google.es".
this is another link http://www.somewhere.com
this if ftp link ftp://www.anywhere.com the end

$ awk '{gsub(/(http|ftp):\/\/.[^" ]*/,"") }1'  file
"Hello!!, I love ".
this is another link
this if ftp link  the end

Of course, you can also adapt the regex to Perl if you like

ghostdog74
Using a hand-rolled regex to find URIs is going to be frought with errors. The actual standards-conforming patterns are much more complicated than what you have shown.
Ether
@Ether, that's BS, OP's requirement is simple. A regex approach is definitely ok. I don't have to download any modules for that.
ghostdog74
What requirement? He didn't say anything about limiting it to only two URI schemes. There's a lot that your regex doesn't handle. A regex may be fine, but your regex is not.
brian d foy
Dude, read his post. Most of hist urls are at the last of the text. And I am only answering based on what he gives in his question. I can infer that his requirement is either complicated, or simple. I chose the latter. I don't really care if you think my regex is not enough for general purpose. I will leave it the OP to decide, not you.
ghostdog74
@ghost: the OP likely doesn't know what is sufficient; he's hoping that we'll give him good advice. :)
Ether
The position of the URIs isn't the issue we're disputing.
brian d foy
why not? If they are at the last position, all we need is to remove the last field. don't even need regex or URI Find.
ghostdog74
+2  A: 
  • You can use URI::Find to extract URLs from an arbitrary text document.
  • or use Regexp::Common::URI- provide patterns for URIs.

    use strict;
    use warning;
    use Regexp::Common qw/URI/;
    my $str = "Hello!!, I love http://www.google.es";
    my ($uri) = $str =~ /$RE{URI}{-keep}/;
    print "$uri\n"; #output: http://www.google.es
    
Nikhil Jain
Sadly Regexp::Common doesn't support everything it should.
brian d foy
@brian d foy: ok understand but I also suggest `URI::Find` :)
Nikhil Jain
+5  A: 

Perhaps you want URI::Find, which can find URIs in arbitrary text. The return value from the code reference you give it produces the replacement string for the URL, so you can just return the empty string if you merely want to get rid of the URIs:

use URI::Find;

my $string = do { local $/; <DATA> };

my $finder = URI::Find->new( sub { '' } );
$finder->find(\$string );

print $string;

__END__
This has a mailto:[email protected]
Go to http://www.google.com
Pay at https://paypal.com
From ftp://ftp.cpan.org download a file
brian d foy