tags:

views:

59

answers:

2

solution: this works:

String p="<pre>[\\\\w\\\\W]*</pre>";

I want to match and capture the enclosing content of the <pre></pre> tag tried the following, not working, what's wrong?

String p="<pre>.*</pre>";

        Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
        if(m.find()){
            String g=m.group(0);
            System.out.println("g is "+g);
        }
+3  A: 

You want the DOTALL flag, not MULTILINE. MULTILINE changes the behavior of the ^ and $, while DOTALL is the one that lets . match line separators. You probably want to use a reluctant quantifier, too:

String p = "<pre>.*?</pre>";
Alan Moore
what's the reluctant ? for?
If there's more than one `<pre>` element, a greedy `.*` will match from the first opening `<pre>` to the last closing `</pre>`. The reluctant (or non-greedy) `.*?` will stop at the first closing tag.
Alan Moore
+2  A: 

Regex is in fact not the right tool for this. Use a parser. Jsoup is a nice one.

Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
    System.out.println(element.text());
}

The parse() method can also take an URL or File by the way.


The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. It not only provides JavaScript like methods returning elements implementing Iterable, but it also supports jQuery like selectors and that was a big plus for me.

BalusC