ansaurus

Question

How to write correct Regex for url's on the page without anchors?

Answer 1

+1 A:

You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:

(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)

(^|[^'"]+) means start of string or a character that is NOT a quote ([^'"]|$) means end of string or not a quote

The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url

BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.

SpliFF 2009-05-18 18:18:08

Could you provide any samples of good regex's

omoto 2009-05-18 18:28:53

ansaurus

tags:

views:

answers:

How to write correct Regex for url's on the page without anchors?

related questions