views:

59

answers:

3

I'm using the following method to parse URLs:

Regex.Replace(text, @"((www\.|(http|https|ftp)\://)[.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])",
                            "<a href=\"$1\" target=\"&#95;blank\">$1</a>", RegexOptions.IgnoreCase).Replace("href=\"www.", "href=\"http://www.");

It works great, but:

  1. asdhttp://google.com will be parsed, so how can I disallow characters before "http://" / "www"?

  2. When a URL is inside a tag, I don't want it to be parsed:

[url]http://google.com[/url]

How can I do that?

+1  A: 

use ^ before http and www which means your string should start with http, www or https or ftp

^(www\.|(http|https|ftp)
Sachin Shanbhag
But then something like "google: http ://google.com" won't work
Alex
@Alex: Do you have specific set of strings which need to be allowed or not? Because if you try to include google, then you will have to include adshttp as well. or you have to hardcode google like http|ftp|https|google
Sachin Shanbhag
I just have to parse URLs in a text. Just like any forum works."Hello, this is my website: http: //as.com" - URL should be parsed here."Hihttp://as.com" - should not be parsed. So using ^ and $ is not a solution.
Alex
A: 

added ^ at the beginning and $ at the end, nothing comes before http and after the normal url

Regex.Replace(text, @"^((www\.|(http|https|ftp)\://)[.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])$",
                            "<a href=\"$1\" target=\"&#95;blank\">$1</a>", RegexOptions.IgnoreCase).Replace("href=\"www.", "href=\"http://www.");
red-X
A: 

Since the it seems the url is part a part or a block of text, use \b for word boundary:

Regex.Replace(text, @"\b((www\.| ... "

Your second question is a bit more tricky - have you considered using the same regex for both tasks?

Kobi
Looks like that's what I need. But how can I exclude the word?
Alex
"[^\b(\[url\])]" doesn't work
Alex
@Alex - I gave it some thought, and it isn't so simple. You could use `(?<=\[url\])` before the regex (negative look behind), but it wouldn't work for `[url]http://www.example.com[/url]` - which *will* capture `www.example.com`. As I've said, you may need to write a small parser for that, so you can parse these tokens first, and let the regex handle the rest.
Kobi
Ok, thanks. I'll try to find something about BB code parsers online.
Alex