ansaurus

Question

regex to turn URLs into links without messing with existing links in the text

Answer 1

+2 A:

This is almost impossible to do with a single regular expression. I would instead recommend a state-machine based approach. Something like this (in pseudo-code)

state = OUTSIDE_LINK
for pos (0 .. length input)
   switch state
   case OUTSIDE_LINK
     if substring at pos matches /<a/
       state = INSIDE_LINK
     else if substring at pos matches /(www.\S+|\S+.com|\S+.org)/
       substitute link
   case INSIDE_LINK
     if substring at post matches /<\/a>/
       state = OUTSIDE_LINK

amarillion 2009-06-11 13:25:18

@Tomalak - apologies, I did try my best to search for similar questions before - and found similar posts, but none that answered my question@amarillion Thanks very much, that works. I am sure there must be a way to do it using negative lookbacks? However this answer is perfect for what I was trying to do.

Ben 2009-06-11 15:40:30

Answer 2

+1 A:

Another way of doing it (in php)

    $strParts = preg_split( '/(<[^>]+>)/', $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
    foreach( $strParts as $key=>$part ) {

        /*check this part isn't a tag or inside a link*/
        if( !(preg_match( '@(<[^>]+>)@', $part ) || preg_match( '@(<a[^>]+>)@', $strParts[$key - 1] )) ) {
            $strParts[$key] = preg_replace( '@((http(s)?://)?(\S+\.{1}[^\s\,\.\!]+))@', '<a href="http$3://$4">$1</a>', $strParts[$key] );
        }

    }
    $html = implode( $strParts );

Ben 2009-06-12 16:31:34

Your code has an error `Undefined offset: -1`. The fix is to change `preg_match( '@(<a[^>]+>)@', $strParts[$key - 1] )` to `preg_match('@(<a[^>]+>)@', $strParts[$key ? $key - 1 : 0])`

Justin Johnson 2009-11-13 22:11:26

Answer 3

+1 A:

Another trick is to guard all the existing links by encoding the code, then replacing urls with links, and then un-encoding the guarded links.

$data = 'test http://foo <a href="http://link"&gt;LINK&lt;/a&gt; test';

$data = preg_replace_callback('/(<a href=".+?<\/a>)/','guard_url',$data);

$data = preg_replace_callback('/(http:\/\/.+?)([ .\\n\\r])/','link_url',$data);

$data = preg_replace_callback('/{{([a-zA-Z0-9+]+?)}}/','unguard_url',$data);

print $data;

function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return '<a href="'.$arr[1].'">'.$arr[1].'</a>'.$arr[2]; }

The code above is just a proof of concept, and doesn't handle all situations. Still, you can see that the code is pretty straightforward.

johnk 2009-09-22 05:43:19

Answer 4

+3 A:

Finally finished it:

function add_url_links($data)
{
        $data = preg_replace_callback('/(<a href=.+?<\/a>)/','guard_url',$data);

        $data = preg_replace_callback('/(http:\/\/.+?)([ \\n\\r])/','link_url',$data);
        $data = preg_replace_callback('/^(http:\/\/.+?)/','link_url',$data);
        $data = preg_replace_callback('/(http:\/\/.+?)$/','link_url',$data);

        $data = preg_replace_callback('/{{([a-zA-Z0-9+=]+?)}}/','unguard_url',$data);

        return $data;
}

function guard_url($arr) { return '{{'.base64_encode($arr[1]).'}}'; }
function unguard_url($arr) { return base64_decode($arr[1]); }
function link_url($arr) { return guard_url(array('','<a href="'.$arr[1].'">'.$arr[1].'</a>')).$arr[2]; }

2009-09-24 16:54:31

Your solution is innovative but I feel that it could be much simpler and faster if your regex language has look-behinds - simply add `(?<!href=")` to the beginning of your conversion expression.

Renesis 2010-02-24 17:20:00

Answer 5

A:

Why not strip the links first and treat the entire text fairly and start all over and scoop out the links? Stripping sounds easy enough.

h1d 2010-01-07 15:05:13

not all links have a URL as the label - once you've stripped the link out of <a href="http://www.mysite.com">visit my site</a>, you'd be stuck.

Ben 2010-01-12 17:05:02

ansaurus

tags:

views:

answers:

regex to turn URLs into links without messing with existing links in the text

related questions