tags:

views:

103

answers:

4
+2  Q: 

Except URL regex

Sigh, regex trouble again.

I have following in $text:

[img]http://www.site.com/logo.jpg[/img]

and 

[url]http://www.site.com[/url]

I have regex expression:

$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);

The point is to replace url only if it's not preceded by [img] or [url] and not followed by [/img] or [/url]. On the output of previous example I get:

there was link

and

there was link

Both, URL and lookbehind and lookforward regexps are working fine separately.

$text = "[img]bash.org/logo.jpg[/img]";

$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);

echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img] 

I suppose the problem is in combination of lookarounds and URL regex. Where's my mistake?

I WANT TO

replace http://www.google.com with "there was link", but leave as is "[url]http://www.google.com[/url]"

I'M GETTING

http://www.google.com replaced with "there was link" and [url]http://www.google.com[/url] replaced with "there was link"

HERE'S PHP CODE TO TEST

<?php

$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 
         // should NOT be changed                  //should be changed    

$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);

echo $text;

echo '<hr width="100%">';

$text = ":) :-) 0:) 0:-) :)) :-))";

$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);

echo $text; // lookarounds work

echo '<hr width="100%">';

$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";

$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);

echo $text; // URL pattern works fine

?>
+2  A: 

Assuming I'm understanding you, you wish to replace all URLs in your $input, with the words 'link was here', unless the URL was within either the url or img bbcode tags. The reason the lookaround assertions aren't working is because those parts are actually matching against your very greedy URL pattern (which I'm fairly sure does lots of things you don't mean it to). Writing a pattern that will match any valid URL (including query string) within other text and that will also not match the tags attached to it is not necessarily the simplest of matters. Especially since your current pattern has the http:// or ftp:// as optional.

The only way you are likely to gain any success is to decide on a strict set of rules that constitute a url.

Cags
You think that URL regexp is sooooooooooooo greedy that it grabs everyting even with lookahead?
Vlad
Yes I know it is, if you run the example you gave it's not just the URL that is replaced it's the URL and the bbcode tags. The wrap around simply checks that it isn't preceded or followed by that code, which it isn't because your main regex is consuming those characters.
Cags
A: 

It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.

I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:

  1. find the string segment "[url]"
  2. capture anything that proceeds it
  3. end the capture when the string segment "[/url]" is seen

That is an easy regex:

$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 

$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;

I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.

You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:

$regex = "#(?!\[url\])(.*)(?!\[/url\])#";

One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.

Tim
Well actually I simplified my task before posting it here. The thing is that I already have code to change url to proper html <a href...></a> tag. It works pretty fine with that url regexp. But when I added [url][/url] parser here's the problem. Previous preg_replace changes url to <a href...></a> and when it's time to parse [url][/url] it gets formatted html tag. All I wanted is to prevent of changing url if it's included in [url][/url] to allow it parse correctly later.
Vlad
A: 

Final working regexp looks like:

(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])

Example:

<?php

$text = "

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

www.google.com/logo.jpg

http://google.com/logo.jpg

http://www.google.com/logo.jpg

";

$text = nl2br($text);


$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);

echo $text;

?>

outputs:

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

link

link

link

The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.

Vlad
A: 

Where's my mistake?

Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.

EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.

Try this regex (or see it in action on ideone):

'_\b(?>
     (?>www\.|ftp\.|(?:https?|ftp|file)://)  # scheme or subdomain
     [-+&@#/%=~|$?!:,.\w]*[+&@#/%=~|$\w]     # everything else
   )(?!\[/(?:img|url)\])
 _x'

Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.

Alan Moore