tags:

views:

61

answers:

4

I have a site that enables users to post messages to a forum.

At present, if a user types a web address or email address and posts it, it's treated the same as any other piece of text.

There are tools that enable the user to supply hyper-linked web and email addresses (via some bespoke tags/markup) - these are sometimes used, but not always. In addition, a bespoke 'Image' tag can also be used to reference images that are hosted on the web.

My objective is to both cater for those that use these existing tools to generate hyper-linked addresses, but to also cater for those that simply type a web or email address in, and to then automatically convert this to a hyper-linked address for them (as soon as they submit their post).

I've found one or two regular expressions that convert a plain string web or email address, however, I obviously don't want to perform any manipulation on addresses that are already being handled via the sites bespoke tagging, and that's where I'm stuck - how to EXCLUDE any web or email addresses that are already catered for via the bespoke tagging - I wan't to leave them as is.

Here are some examples of bespoke tagging for the variations that I need to be left alone:

[URL=www.msn.com]www.msn.com[/URL]

[URL=http://www.msn.com]http://www.msn.com[/URL]

[[email protected]][email protected][/EMAIL]

[IMG]www.msn.com/images/test.jpg[/IMG]

[IMG]http://www.msn.com/images/test.jpg[/IMG]

The following examples would however ideally need to be automatically converted into web & email links respectively:

www.msn.com

http://www.msn.com

[email protected]

Ideally, the 'converted' links would just have the appropriate bespoke tags applied to them as per the initial examples earlier in this post, so rather than:

<a href="..." etc.

they'd become:

[URL=http://www.. etc.)

Unfortunately, we have a LOT of historic data stored with this bespoke tagging throughout, so for now, we'd like to retain that rather than implementing an entirely new way of storing our users posts.

Any help would be much appreciated.

Thanks.

+2  A: 

You'll want to add negative lookaround assertions to you regular expressions. .NET supports this fully.

http://www.regular-expressions.info/lookaround.html

Negative lookahead asserts that your pattern is not followed by something. The syntax is (?!xxx), where xxx is a pattern defining what you don't want. You could use (?!\[\/URL\]) for links, for example.

Negative lookbehind looks like (?<!xxx). Here you'll need a pattern -- something like (?<!\[URL=.*?\]) -- but you could make this more robust, if needed.

Jay
So, put the negative lookbehind in front of your regex and the negative lookahead at the end, and your pattern will only be matched when it is not preceded or followed by those tags.
Jay
Thanks for the info Jay.So far I've done the following, and it ignored the [URL=] tagged string which is great:Regex urlregex = new Regex(@"(?<!\[URL=.*?\])(http:\/\/([\w.]+\/?)\S*)(?!\[\/URL\])", RegexOptions.IgnoreCase | RegexOptions.Compiled);However, I also need to apply some additional code to ignore the bespoke [IMG]... tagged text AND the [EMAIL=... tagged text.How would I also incorporate those 2 into the regular expression so those bespoke tagged text items are also ignored?Thanks again for your help so far.
marcusstarnes
Frankly, I like Amethi's solution better -- much simpler. It is similar to how it works on StackOverflow. You'd need to create alternation groups -- wrap options in parens and separate by pipe character `|`, so if you wanted to match a, b, or c, you'd use `(a|b|c)`. This is going to get ugly, though, and I'm not sure a single very complex regex is going to be more performant than three passes with more simple patterns. I'd try it as three separate regexes and only try the combo if the matching is too sluggish.
Jay
Thanks for the additional feedback Jay. I've just been trying Amethi's code example that he provided and I think things are looking good with it at the moment - I'll do some further testing then report back.Thanks again.
marcusstarnes
A: 

Jay's right, though you could also use those plain-link matching regex's you have and just add \b to the start and end so it only matches links that don't have stuff around them, i.e. your forum-code tags.

\b is word-boundary, i.e. spaces, periods, commas, etc, mean it's a stand-alone word and not part of something bigger.

I did the same thing for my forum software. I parsed the forum-code first, so it built anchor tags, and then I looked for plain links on their own using such a regex and converted those.

Amethi
Hi Amethi. Thanks for the info. Would this handle addresses that appear as the first item of text in the post (i.e. no space before) or at the start of a new line?If so, how would the syntax be applied to my existing plain-link matching regex? E.g. Regex urlregex = new Regex(@"(http:\/\/([\w.]+\/?)\S*)", RegexOptions.IgnoreCase | RegexOptions.Compiled);Thanks.
marcusstarnes
New lines yes, first character, I'm not so sure, but if not, you could add a space to the start of your post and then trim it off afterwards. Not a plain regex solution, but I'm not THAT good with them to know how to do it all in one funky regex.As for implementing it, it'd be something like:new Regex(@"\b(http:\/\/([\w.]+\/?)\S*)\b", RegexOptions.IgnoreCase | RegexOptions.Compiled);But don't quote me on that. You'd have to bung it into a regex tester (there's loads online, or there's freeware apps you can download). Oh and unit-tests, you're going to write a unit-tet for this right? :)
Amethi
I just tried the example you've provided but unfortunately, it still matches my bespoke tagged text (e.g. [URL=http://www...etc]...[/URL]) :( Re: Unit-Tests, [pretend]yep[/pretend] :/
marcusstarnes
+1  A: 

Here's the method I use. I don't have access right now to the full codebase so can't see how that fits in alongside the forum-code to stop double-linking, but try it out and see if it works for you...

/// <summary>
    /// Turns any literal URL references in a block of text into ANCHOR html elements.
    /// </summary>
    public static string ActivateLinksInText(string source)
    {
        source = " " + source + " ";
        // easier to convert BR's to something more neutral for now.
        source = Regex.Replace(source, "<br>|<br />|<br/>", "\n");
        source = Regex.Replace(source, @"([\s])(www\..*?|http://.*?)([\s])", "$1<a href=\"$2\" target=\"_blank\">$2</a>$3");
        source = Regex.Replace(source, @"href=""www\.", "href=\"http://www.");
        //source = Regex.Replace(source, "\n", "<br />");
        return source.Trim();
    }
Amethi
This code is proving very useful. I've just tweaked a couple bits to accommodate my bespoke tagging and so far it appears to be ticking all the boxes - leaving my bespoke tags untreated but handling all other URL/Email instances that I need it to.source = Regex.Replace(source, @"([\s])(www\..*?|http://.*?)([\s])", "$1[URL=$2]$2[/URL]$3");source = Regex.Replace(source, @"([\s])([a-zA-Z_0-9.-]+\@[a-zA-Z_0-9.-]+\.\w+)([\s])", "$1[EMAIL=$2]$2[/EMAIL]$3");source = Regex.Replace(source, @"URL=www\.", "URL=http://www.");I will continue to run some additional tests this morning and get back...
marcusstarnes
@Amethi: This seems to be working nicely - I've been doing numerous tests and have been unable to break things (so far) so will mark this as the accepted answer. Thanks again for your help!
marcusstarnes
Glad it helped! Regex's are one of those things that I at least, learn, and then forget the next time I need to do something with them.
Amethi
A: 

The regex you are looking for is (?<![EMAIL=\1])(\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)(?!\[\/EMAIL]). At least, this is what you need for the email tag. Your replace would simply be [EMAIL=$1]$1[/EMAIL]. For the others you need to replace the center group and the EMAIL tags with whatever is appropriate.

Test Cases:

[[email protected]][email protected][/EMAIL] : FALSE
[email protected] : TRUE

Evaluated under .NET Regex, as per your tag.

DonaldRay