ansaurus

Question

Java regular expression to match patterns and extract them

Answer 1

+6 A:

#[^#]+#

Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.

The reason why your's does not work is the greediness of the star (from regular-expressions.info):

[The star] repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.

Tomalak 2009-09-09 09:52:32

Works great! Thanks a lot. But why do you think # is not recommended for URL delimiting?

Keshav 2009-09-09 09:55:07

Because the # is a valid character in URLs - it is the fragment identifier.

Tomalak 2009-09-09 09:57:27

Oh yes! I forgot about URLS like http://test.com/index.html#section1

Keshav 2009-09-09 09:58:34

Maybe using something like `'['` and `']'`, and a regex of `\[\S+\]` would work better.

Tomalak 2009-09-09 10:03:23

Answer 2

+5 A:

Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.

If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:

#[^#]*#

Amber 2009-09-09 09:52:59

Yep, Java supports reluctant matches. Java RE's are based on Perl 5, and just about everything you can do in Perl is possible in Java, it's just likely to be 10 times more verbose (and twice as readable).

corlettk 2009-09-09 10:08:46

I dis-recommend using non-greedy quantifiers when a character exclusion would do the job. Character exclusions are faster because they won't backtrack.

Tomalak 2009-09-09 10:08:50

Neither will non-greedy quantifiers if they can find a match without backtracking.

Amber 2009-09-09 20:06:07

Answer 3

+2 A:

Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to

match a "#"
match as many characters as possible such that you can still ...
... match a "#"

What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

janko 2009-09-09 09:55:30

ansaurus

tags:

views:

answers:

Java regular expression to match patterns and extract them

related questions