ansaurus

Question

content URLs regexp

Answer 1

A:

You can try this:

Regex:

(http?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)

Replacement:

<a href="$1">$1</a>

jerjer 2009-11-25 01:40:21

Thanks. I'm using PHP, where the rule is wrapper by slashes, e.g.: '/(.*)/'.So your rule currently fires an error (Unknown modifier '/'). I tried escaping all the contained slashes from the rule, but then the result was just like mine, the dots were included in the URL. However, it excluded a comma from one of the URLs correctly.

treznik 2009-11-25 01:51:22

You're right there skidding, "-" char as well may break this rule.

jerjer 2009-11-25 09:47:06

Answer 2

+2 A:

skidding, add a ? after the [^\s]+ to make it non-greedy and then an "optional" period with ? - I used the following sample text in a file:

Lorem I receive a block of code from db which occasionally contains
urls, e.g, http://site.tld/lorem.ipsum/whatever and
http://site.tld/lorem.ipsum/whatevertwo. Now I want to turn this into
nice clickable link for the user, with a helper method. Such as.

and then ran the following code on the command line, and it seems to satisfy your requirements:

perl -pe 's#(http://[^\s]+?)(\.?)(\s)#&lt;a href="$1">$1</a>$2$3#g' foo.txt

... resulting in:

Lorem I receive a block of code from db which occasionally contains
urls, e.g, <a href="http://site.tld/lorem.ipsum/whatever"&gt;http://site.tld/lorem.ipsum/whatever&lt;/a&gt; and
<a href="http://site.tld/lorem.ipsum/whatevertwo"&gt;http://site.tld/lorem.ipsum/whatevertwo&lt;/a&gt;. Now I want to turn this into
nice clickable link for the user, with a helper method. Such as.

Does that work?

Chirael 2009-11-25 01:48:37

Brilliant! I knew there must be a concept of this kind. And when you said "to make it non-greedy" my heart grew. That was exactly what I was looking for. I remember I used this before but I was less inspired now I think. Do you know where I could find something to read about this idea to make sure I'm 100% aware of how it works? Thanks again!

treznik 2009-11-25 01:57:57

Awesome - if this answered your question, do you mind clicking the check mark to "accept" it? I'm new on the site and have been suckered into this whole "reputation score" thing ;) (thanks :)

Chirael 2009-11-25 01:59:00

Oh and as for the idea of greedy or non-greedy, I don't have a canonical source at hand since I learned about regular expressions back in the mid 1990s when Perl was THE language and CGI was THE thing to do (back in Perl 4's heyday). So the only thing I could recommend is "man perlre", though I believe O'reilly has a book on regexps that might be worth browsing.

Chirael 2009-11-25 02:00:13

Heh, thanks. Take a look at my final notes, I wrapped it up and it works perfectly for my needs now.

treznik 2009-11-25 02:06:10

Answer 3

+1 A:

You can also try a different approach: instead of listing what you don't want included at the end of your URL, you can specify what's acceptable as last character. In this example:

$str = preg_replace('#(http://\S+[a-z0-9/])#', '<a href="\1">\1</a>', $str);

I'm asking for a sequence of non-spaces and an alphanumeric character (plus slash) at the end (that's usually how valid URLs end).

A couple of notes also:

in PHP (as in Perl) you can choose your pattern delimiters, / / is just conventional but you can pick (almost) whatever character you like: picking the right delimiter avoids a lot of escaping
alternation of single characters is better written as a character class: [,.;:] is much easier to read than (\,|\.|\;|\:) which also includes unnecessary escaping (only the dot needs it)
learn what needs to be escaped and what not, filling your pattern with backslashes will make it unreadable

kemp 2009-11-27 21:48:24

I'll have to revise this when I have more time but your approach seems perfect, yet it seems so obvious that I can't believe it hadn't crossed my mind. Also, about the pattern delimiters, how exactly I can pick it? Just writing it as the first character automatically assigns it as the delimiter? I guess I never mastered when to use brackets (besides when needing a variable container, or a specific character class, like a-z, 0-9, etc.), and which type to use. You're right about escaping, it looks ugly, but I noticed it sometimes depends on the language, so my thoughts are going safe. Thanks!

treznik 2009-11-28 00:24:09

Yes, the first character in the pattern becomes the delimiter, and you have to match it at the end. You can also use all kinds of brackets, in that case you match them "naturally" instead of repeating the first character: `(..pattern..)`, `{..pattern..}` and so on.

kemp 2009-11-28 00:34:54

ansaurus

tags:

views:

answers:

content URLs regexp

related questions