ansaurus

Question

improving regular expression to match all "http" only urls very neatly .

Answer 1

+6 A:

You can use lookahead instead of making ['\"\< >] part of your match, i.e.:

(http:\/\/.*?)(?=['\"\< >])

Generally speaking, whereas ab matches ab, a(?=b) matches a (if it's followed by b).

References

regular-expressions.info/Lookarounds

Capturing group option

Lookarounds are not supported by all flavors. More widely supported are capturing groups.

Generally speaking, whereas (a)b still matches ab, it also captures a in group 1.

References

regular-expressions.info/Round Brackets for Grouping

Negated character class option

Depending on the need, often times using a negated character class is much better than using a reluctant .*? (followed by a lookahead to assert the terminator pattern in this case).

Let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.

We use the following as input:

eeAiiZooAuuZZeeeZZfff

We use 3 different patterns:

A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
- This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
- This is the reluctant variant; group 1 matched and captured iiZooAuu
A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
- This is the negated character class variant; group 1 matched and captured uu

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

References

regular-expressions.info/Character Class and Repetition: An Alternative to Laziness

Related questions

Difference between .*? and .* for regex

polygenelubricants 2010-07-13 12:21:46

Why was this downvoted? Because believe me, if I have to go an extra mile or two, I will.

polygenelubricants 2010-07-13 12:43:05

Huh? Makes no sense for this to be downvoted and not the other (at the time) very similar lookahead question, which came after. Well, assuming the downvote was cast before the boat picture was added. :)

Peter Boughton 2010-07-13 13:44:34

Answer 2

+1 A:

You need to use "(?=regex)" (lookahead), which lookups a particular pattern, but doesn't include it in the result:

http:\/\/.*?(?=['\"\< >])

R. Hill 2010-07-13 12:24:13

Answer 3

+1 A:

Hmmm, I'd probably do this simply by saying "keep going until you get an unwanted character", like so:

http://[^'"&lt; >]*

Escaped version (based on Q - not sure what engine this is):

http:\/\/[^'\"\< >]*

However the lookahead solution by polygenelubricants is a more flexible way, if you might have some of those characters in the URL (but not at the end).

Peter Boughton 2010-07-13 12:26:30

+1; whenever applicable, negated charclass is definitely the way to go.

polygenelubricants 2010-07-13 12:27:27

Hey Peter, congratulations back on reaching 10k! :)

Tim Pietzcker 2010-07-13 12:37:55

Thanks Tim. :) And thanks to poly for the vote which took me over the milestone. :)

Peter Boughton 2010-07-13 13:31:16

ansaurus

tags:

views:

answers:

improving regular expression to match all "http" only urls very neatly .

References

Related questions

Capturing group option

References

Related questions

Negated character class option

References

Related questions

related questions