tags:

views:

366

answers:

3

Hello,

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ParseLinks {
    public static void main(String[] args) {
     String message = "This is a link- #www.google.com# and this is another #google.com#";
     Pattern p = Pattern.compile("#.*#");

     Matcher matcher = p.matcher(message);

     while(matcher.find()) {
      String result = matcher.group();
      System.out.println(result);
     }  
    }
}

This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?

Thanks,
-Keshav

+6  A: 
#[^#]+#

Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.

The reason why your's does not work is the greediness of the star (from regular-expressions.info):

[The star] repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.

Tomalak
Works great! Thanks a lot. But why do you think # is not recommended for URL delimiting?
Keshav
Because the # is a valid character in URLs - it is the fragment identifier.
Tomalak
Oh yes! I forgot about URLS like http://test.com/index.html#section1
Keshav
Maybe using something like `'['` and `']'`, and a regex of `\[\S+\]` would work better.
Tomalak
+5  A: 

Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.

If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:

#[^#]*#
Amber
Yep, Java supports reluctant matches. Java RE's are based on Perl 5, and just about everything you can do in Perl is possible in Java, it's just likely to be 10 times more verbose (and twice as readable).
corlettk
I dis-recommend using non-greedy quantifiers when a character exclusion would do the job. Character exclusions are faster because they won't backtrack.
Tomalak
Neither will non-greedy quantifiers if they can find a match without backtracking.
Amber
+2  A: 

Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to

  • match a "#"
  • match as many characters as possible such that you can still ...
  • ... match a "#"

What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

janko