views:

1551

answers:

2

In Java, is there a simple way to extract a substring by specifying the regular expression delimiters on either side, without including the delimiters in the final substring?

For example, if I have a string like this:

<row><column>Header text</column></row>

what is the easiest way to extract the substring:

Header text

Please note that the substring may contain line breaks...

thanks!

+5  A: 

Write a regex like this:

"(regex1)(.*)(regex2)"

... and pull out the middle group from the matcher (to handle newlines in your pattern you want to use Pattern.DOTALL).

Using your example we can write a program like:

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex {

    public static void main(String[] args) {
     Pattern p = Pattern.compile(
                "<row><column>(.*)</column></row>",
                Pattern.DOTALL
            );

     Matcher matcher = p.matcher(
                "<row><column>Header\n\n\ntext</column></row>"
            );

     if(matcher.matches()){
      System.out.println(matcher.group(1));
     }
    }

}

Which when run prints out:

Header


text
Aaron Maenpaa
@Adam ... it's only because I needed to fire up Eclipse to get an example and wanted to get an answer up quickly ;)
Aaron Maenpaa
@Aaron: fair enough. I may as well delete my first comment then :) Nice answer.
Adam Bernier
@Aaron - thank you, your example works! But please could you tell me what regular expression pattern to use to extract the same text from a string like this, which includes some single quotes?<row><column name='title'>Header\n\n\ntext</column></row>I've tried using Pattern p = Pattern.compile( "<row><column name='title'>(.*)</column></row>", Pattern.DOTALL );and the same but with backslashes in front of the quotes, but neither work.Sorry, I am very new to regular expressions, appreciate the help.Thank you again!Anna
Anna, that's why it is easier to just use the proper tool to parse XML: an XML parser. XML is not a regular language, so don't try to parse it with regular expressions.
Svante
+1  A: 

You should not use regular expressions to decode XML - this will eventually break if the input is not strictly controlled.

The easiest thing is probably to parse the XML up in a DOM tree (Java 1.4 and newer contain a XML parser directly) and then navigate the tree to pick out what you need.

Perhaps you would like to tell what you want to accomplish with your program?

Thorbjørn Ravn Andersen
+1 once you've got a DOM tree you can use XPath to pull out the bits you want.
Aaron Maenpaa