views:

74

answers:

1

i need to parse all urls from a paragraph(string)
eg.

"check out this site google.com and don't forget to see this too bing.com/maps"

it should return "google.com and bing.com/maps"

i'm currently using this and its not to perfection.

reMatch("(^|\s)[^\s@]+\.[^\s@\?\/]{2,5}((\?|\/)\S*)?",mystring)

thanks

+3  A: 

You need to define more clearly what you consider a URL

For example, I might use something such as this:

(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?

(use with reMatchNoCase or plonk (?i) at front to ignore case)

Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.

It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever - it depends on the context of what you're doing as to whether you'd like to err towards missing URLs or detecting non-URLs. (I'd probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you're doing.)


Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. :) (Note that all groups are non-capturing (?:...) since we don't need the indiv parts.)

# PROTOCOL
 (?:https?:)?    # optional group of "http:" or "https:"

# SERVER NAME / DOMAIN
 (?://)?         # optional double forward slash
 (?:[\w-]+\.)+   # one or more "word characters" or hyphens, followed by a literal .
                 # grouped together and repeated one or more times
 [a-z]{2,6}      # as many as 6 alphas, but at least 2

# PORT NUMBER
 (?::\d+)?       # an optional group made up of : and one or more digits

# PATH INFO
 (?:/[\w.,-]+)*  # a forward slash then multiple alphanumeric, underscores, or hyphens
                 # or dots or commas (add any other characters as required)
                 # in a group that might occur multiple times (or not at all)

# QUERY STRING
 (?:\?\S+)?      # an optional group containing ? then any non-whitespace



Update: To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don't have an @ sign (or anything else unwanted) but without actually including that prior character in the match.

CF's regex is Apache ORO which doesn't support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.

Using that is as simple as:

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />

After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.

(If you have any problems or questions with the component, let me know.)


So, on to your excluding emails from URL matching problem:

We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say "we must have this" or "we must not have this", like so:

(?<=\s) # there must be whitespace before the current position
(?<!@)  # there must NOT be an @ before current position

For this URL example, I would expand either of those examples to:

(?<=\s|^)   # look for whitespace OR start of string

or

(?<![@\w/]) # ensure there is not a @ or / or word character.

Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.

Put whichever one you like at the start of your expression, and it should no longer match the end of [email protected], unless I've screwed something up. :)


Update 2:

Here is some sample code which will exclude any email addresses from the match:

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />

<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an [email protected] which should not be matched
</cfsavecontent>

<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />

<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />

<cfdump var=#MatchedUrls#/>

Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).

This step is required because the (?<=...) construct does not work in CF regular expressions.

Peter Boughton
thanks for the answer!!it seems to work fine just one thing. it will add gmail.com from a email address [email protected] so if mystring had an email address in it it will add the email domain as a url is there a way to avoid any @ signs??
loo
Hmmm, I'd do that with a lookbehind, but CF uses Apache ORO regex, which doesn't support those.I do however have a CFC which provides easy access to Java's regex from CF - I'll update my answer with details of all this.
Peter Boughton
thanks again but you might be right about "screwed something up"!!!im getting an error when adding the look back to omit a email addressif you can please revise it i would really appreciate that.this seemed to work if you can fill in the omit email as is...(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w-]+)*(?:\?\S+)?
loo
Hmmm, can you provide sample input so I can check it out? (Feel free to email to [email protected] if that's easier/preferred.)
Peter Boughton
let me clarify what im trying to achieve... i need to parse a full paragraph of text to extract all URLs. "check out this site google.com and don't forget to see this too bing.com/maps" now either it should require http:// or just www or even just google.com witch can be much trickier. now i need to avoid a email address from being considered a URL.. i don't have a sample at the moment. when you sent me this (?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w-]+)*(?:\?\S+)? it worked i just needed to add something to avoid email address from being considered a URL.
loo
Ok, see my latest update above. If this doesn't work for you, post details of what the error is.
Peter Boughton
thanks for your quick response and the cfc!!! i added it seems to work really well... ill run some more tests and let you know...
loo
Hi and thanks again! I found a bug, when parsing a URL like <http://reviews.cnet.com/8301-19512_7-10114978-233.html> it will parse like this <http://reviews.cnet.com/8301-19512_7-10114978-233> it will remove the .html I tried it with a shorter URL etc. But when I tryed <http://cnet.com/8301-19512_7-10114978-233.html> without the sub domain "reviews" it came in good. I guess it stops short after two periods "."
loo
Ah, sorry - yeah, should be allowing `.` in the path info part, maybe a few other characters too ... Ok, have updated it. (If I've missed any other required characters, you can probably add them into the `[\w.,-]` part, before the hyphen.)
Peter Boughton
thanks it worked!!!
loo
hi me again i need to add a # as well [\w.,-] it gives an error...i need for a url with a # for a <a name> gets trimmed....
loo
Since `#` is a special character in CF, you need to double it to escape it, so instead of just `[\w#.,-]` use `[\w##.,-]` (this applies to all CF strings, not just regular expressions).
Peter Boughton
loo