ansaurus

Question

Except URL regex

Answer 1

+2 A:

Assuming I'm understanding you, you wish to replace all URLs in your $input, with the words 'link was here', unless the URL was within either the url or img bbcode tags. The reason the lookaround assertions aren't working is because those parts are actually matching against your very greedy URL pattern (which I'm fairly sure does lots of things you don't mean it to). Writing a pattern that will match any valid URL (including query string) within other text and that will also not match the tags attached to it is not necessarily the simplest of matters. Especially since your current pattern has the http:// or ftp:// as optional.

The only way you are likely to gain any success is to decide on a strict set of rules that constitute a url.

Cags 2010-08-14 14:01:31

You think that URL regexp is sooooooooooooo greedy that it grabs everyting even with lookahead?

Vlad 2010-08-14 15:07:31

Yes I know it is, if you run the example you gave it's not just the URL that is replaced it's the URL and the bbcode tags. The wrap around simply checks that it isn't preceded or followed by that code, which it isn't because your main regex is consuming those characters.

Cags 2010-08-14 16:11:57

Answer 2

A:

It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.

I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:

find the string segment "[url]"
capture anything that proceeds it
end the capture when the string segment "[/url]" is seen

That is an easy regex:

$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com"; 

$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;

I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.

You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:

$regex = "#(?!\[url\])(.*)(?!\[/url\])#";

One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.

Tim 2010-08-14 15:42:19

Well actually I simplified my task before posting it here. The thing is that I already have code to change url to proper html <a href...></a> tag. It works pretty fine with that url regexp. But when I added [url][/url] parser here's the problem. Previous preg_replace changes url to <a href...></a> and when it's time to parse [url][/url] it gets formatted html tag. All I wanted is to prevent of changing url if it's included in [url][/url] to allow it parse correctly later.

Vlad 2010-08-14 16:02:12

Answer 3

A:

Final working regexp looks like:

(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])

Example:

<?php

$text = "

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

www.google.com/logo.jpg

http://google.com/logo.jpg

http://www.google.com/logo.jpg

";

$text = nl2br($text);


$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);

echo $text;

?>

outputs:

[img]http://google.com/logo.jpg[/img]

[img]www.google.com/logo.jpg[/img]

[img]http://www.google.com/logo.jpg[/img]

[url]http://google.com/logo.jpg[/url]

[url]www.google.com/logo.jpg[/url]

[url]http://www.google.com/logo.jpg[/url]

link

link

link

The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.

Vlad 2010-08-14 17:54:14

Answer 4

A:

Where's my mistake?

Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.

EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.

Try this regex (or see it in action on ideone):

'_\b(?>
     (?>www\.|ftp\.|(?:https?|ftp|file)://)  # scheme or subdomain
     [-+&@#/%=~|$?!:,.\w]*[+&@#/%=~|$\w]     # everything else
   )(?!\[/(?:img|url)\])
 _x'

Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.

Alan Moore 2010-08-16 04:55:35

ansaurus

tags:

views:

answers:

Except URL regex

related questions