tags:

views:

36

answers:

3

I receive a block of code from db which occasionally contains urls, e.g, http://site.tld/lorem.ipsum/whatever Now I want to turn this into nice clickable link for the user, with a helper method. Such as:

<a href="http://site.tld/lorem.ipsum/whatever"&gt;http://site.tld/lorem.ipsum/whatever&lt;/a&gt;

Of course, anyone can do this, [^\s]+ does the trick. But the the obvious problem is that I if have a dot (.) for example, right after the URL, I don't want it to be included in the link. So we need to limit the URL to a number of characters, but we can't create a rule that matches chars that aren't that specific characters, since the dot I earlier mentioned, is a "url stopper" but it can also be contained in the URL. My first guess what this:

(http\:\/\/[^\s]+)(\,|\.|\;|\:)?

which would be replaced as

<a href="$1">$1</a>$2

But it does not work, since the second variable container is optional, it seems to be preferable for those characters to be included in the first one, since anything is allowed there except the space character.

I really appreciate your help, but honestly, I don't want a gigantic rule found over the internet, that seems to work at the moment. I'm sure there's a cool way to obtain this. I have a decent understanding of regular expressions, but this scenario seems to be something I did not experience before. Or maybe I'm missing something, after all, it is past 3 AM.

Thanks!

Edit:

@Chirael clear it out for me, but here is my final solution:

(http\:\/\/[^\s]+?)(\,|\.|\;|\:)?(\s|$)
  1. I'm clearing the slashes because I'm using PHP
  2. I added more characters as "URL stoppers" in the second variable
  3. Since the first variable becomes "non-greedy", and the 2nd one is optional, if the 3rd one isn't specified the link will only contain the first char after "http://". But there was a problem when the URL was the last thing in the text, so now the 3rd variable can be either a space char or the end of the text.
A: 

You can try this:

Regex:

(http?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)

Replacement:

<a href="$1">$1</a>
jerjer
Thanks. I'm using PHP, where the rule is wrapper by slashes, e.g.: '/(.*)/'.So your rule currently fires an error (Unknown modifier '/'). I tried escaping all the contained slashes from the rule, but then the result was just like mine, the dots were included in the URL. However, it excluded a comma from one of the URLs correctly.
treznik
You're right there skidding, "-" char as well may break this rule.
jerjer
+2  A: 

skidding, add a ? after the [^\s]+ to make it non-greedy and then an "optional" period with ? - I used the following sample text in a file:

Lorem I receive a block of code from db which occasionally contains
urls, e.g, http://site.tld/lorem.ipsum/whatever and
http://site.tld/lorem.ipsum/whatevertwo. Now I want to turn this into
nice clickable link for the user, with a helper method. Such as.

and then ran the following code on the command line, and it seems to satisfy your requirements:

perl -pe 's#(http://[^\s]+?)(\.?)(\s)#&lt;a href="$1">$1</a>$2$3#g' foo.txt

... resulting in:

Lorem I receive a block of code from db which occasionally contains
urls, e.g, <a href="http://site.tld/lorem.ipsum/whatever"&gt;http://site.tld/lorem.ipsum/whatever&lt;/a&gt; and
<a href="http://site.tld/lorem.ipsum/whatevertwo"&gt;http://site.tld/lorem.ipsum/whatevertwo&lt;/a&gt;. Now I want to turn this into
nice clickable link for the user, with a helper method. Such as.

Does that work?

Chirael
Brilliant! I knew there must be a concept of this kind. And when you said "to make it non-greedy" my heart grew. That was exactly what I was looking for. I remember I used this before but I was less inspired now I think. Do you know where I could find something to read about this idea to make sure I'm 100% aware of how it works? Thanks again!
treznik
Awesome - if this answered your question, do you mind clicking the check mark to "accept" it? I'm new on the site and have been suckered into this whole "reputation score" thing ;) (thanks :)
Chirael
Oh and as for the idea of greedy or non-greedy, I don't have a canonical source at hand since I learned about regular expressions back in the mid 1990s when Perl was THE language and CGI was THE thing to do (back in Perl 4's heyday). So the only thing I could recommend is "man perlre", though I believe O'reilly has a book on regexps that might be worth browsing.
Chirael
Heh, thanks. Take a look at my final notes, I wrapped it up and it works perfectly for my needs now.
treznik
+1  A: 

You can also try a different approach: instead of listing what you don't want included at the end of your URL, you can specify what's acceptable as last character. In this example:

$str = preg_replace('#(http://\S+[a-z0-9/])#', '<a href="\1">\1</a>', $str);

I'm asking for a sequence of non-spaces and an alphanumeric character (plus slash) at the end (that's usually how valid URLs end).

A couple of notes also:

  • in PHP (as in Perl) you can choose your pattern delimiters, / / is just conventional but you can pick (almost) whatever character you like: picking the right delimiter avoids a lot of escaping
  • alternation of single characters is better written as a character class: [,.;:] is much easier to read than (\,|\.|\;|\:) which also includes unnecessary escaping (only the dot needs it)
  • learn what needs to be escaped and what not, filling your pattern with backslashes will make it unreadable
kemp
I'll have to revise this when I have more time but your approach seems perfect, yet it seems so obvious that I can't believe it hadn't crossed my mind. Also, about the pattern delimiters, how exactly I can pick it? Just writing it as the first character automatically assigns it as the delimiter? I guess I never mastered when to use brackets (besides when needing a variable container, or a specific character class, like a-z, 0-9, etc.), and which type to use. You're right about escaping, it looks ugly, but I noticed it sometimes depends on the language, so my thoughts are going safe. Thanks!
treznik
Yes, the first character in the pattern becomes the delimiter, and you have to match it at the end. You can also use all kinds of brackets, in that case you match them "naturally" instead of repeating the first character: `(..pattern..)`, `{..pattern..}` and so on.
kemp