views:

1487

answers:

4

[EDITED - really sorry, the code I quoted was wrong - have changed the message below to reflect this. Apologies! Thank you for your patience.]

I'm new to regular expressions and want to match a pattern in Java (following on from this solution - http://stackoverflow.com/questions/962122/java-string-get-everything-between-but-not-including-two-regular-expressions).

The string is [EDITED]:

<row><column name='_id'>1</column></row><row><column name='text'>Header\n\n\ntext</column></row><row><column name='pwd'>password</column></row>

And I want to return only what's between the column name='text' tags, so:

Header\n\n\ntext

I've got the code below [EDITED], but it doesn't match. Any ideas on how I need to change the Pattern?

Thanks!

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

    public class Regex {

        public static void main(String[] args) {
            Pattern p = Pattern.compile(
                    "<row><column name='text'>(.*)</column></row>",
                    Pattern.DOTALL
                );

            Matcher matcher = p.matcher(
                    "<row><column name='_id'>1</column></row><row><column name='text'>Header\n\n\ntext</column></row><row><column name='pwd'>password</column></row>"
                );

            if(matcher.matches()){
                    System.out.println(matcher.group(1));
            }
        }

    }
+1  A: 

Try matching (.*?) instead of just (.*).

(.*) is a greedy search that will match everything after it.

(.*?) will stop at the first occurrence of "</column></row>".

Edit: This shouldn't really affect your example, but if you have another "</column></row>" in the string then your pattern won't match as you expect.

Kai
+3  A: 

The (unedited) code you posted works fine for me... it matches and prints out the message you expect.

The edited code doesn't work, however if you change the regex very slightly to look like this:

Pattern p = Pattern.compile(
            ".*<row><column name='text'>(.*)</column></row>.*",
            Pattern.DOTALL
        );

you get a match:

Header


text</column></row><row><column name='pwd'>password

That's probably not what you actually want though, so you'll need to further refine the regex. Using regular expressions to handle xml/html parsing isn't generally a good approach. Yishai's suggestion of using an XML parser is a better way to do it, otherwise you'll most likely end up with a tremendously complicated and inflexible regular expression.

bm212
Thank you! And thanks for the advice - I'll look at using an XML parser.
+1  A: 

Perhaps what you really want to get to is this:

public static void main(String[] args) {
    Pattern p = Pattern.compile(
            "<row><column name='(.*?)'>(.*?)</column></row>",
            Pattern.DOTALL
        );

    Matcher matcher = p.matcher(
            "<row><column name='text'>Header\n\n\ntext</column></row>"
        );

    if(matcher.matches()){
            System.out.println(matcher.group(2));
    }
}

Because your real example could have anything in the name= value (at least that would seem much more real-world).

That being said, if this gets much more non-trivial, you might want to look at doing this as a SAX parser (that is built in to the JDK 1.5+ so it isn't necessarily a library dependency issue). Regex is a better way to parse XML if you really don't care much about document structure and just want to suck something trivial out of it. However, if you start getting into attributes and caring what they are on the XML, continuing down the regex route is going to be reinventing the wheel.

Yishai
A: 

Your problem has nothing to do with the quote characters. You just need to switch to a non-greedy quantifier (as others have suggested) and use the find() method instead of matches():

public static void main(String[] args)
{
  Pattern p = Pattern.compile(
      "<row><column name='text'>(.*?)</column></row>",
      Pattern.DOTALL
  );

  Matcher matcher = p.matcher(
      "<row><column name='_id'>1</column></row>" +
      "<row><column name='text'>Header\n\n\ntext</column></row>" +
      "<row><column name='pwd'>password</column></row>"
  );

  if(matcher.find()) {
      System.out.println(matcher.group(1));
  }
}

matches() returns true only if the regex matches from the very beginning of the target string to the very end. If you want to match anything less than the whole string, you need to use find().

Alan Moore