tags:

views:

43

answers:

1

I am trying to write a regex to grab an entire url of any .gov or .edu web address to make it into a link.

currently have

/(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/

all in () so i can regurgitate it, for ANY URL...but I only want .gov or .edu ones.

thanks in advance.

A: 

[-A-Z0-9+&@#\/%?=~_|!:,.;]* appears to be slurping up most of the url, so we need to jam the .gov and .edu in here somewhere. The quickest solution would be:

[-A-Z0-9+&@#\/%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*

However, this will match a url like: http://www.example.com/evil.gov/test.html

To fix this, we can take out the / that it is matching before the top level domain:

[-A-Z0-9+&@#%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*

Or, in closing, we have:

/(\b(https?|ftp):\/\/[-A-Z0-9+&@#%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|]?)/

Due to the problem that it doesn't match example.gov, I added a ? to the last token.

Damn that is ugly.

orangeoctopus
Note - many of those symbols are illegal in domain names. Removing them would make it significantly less ugly.
zigdon
Agreed zigdon. Wanted to work with his original regex.
orangeoctopus
It matches `http://FOO.edu-BAR.X` though.
Pumbaa80
thanks guys..and yes i will be removing some of that. i had it in there for testing.
ernie
Haha, I guess we could make it uglier, Pumbaa80.
orangeoctopus