tags:

views:

1427

answers:

2

I am trying to match multiple CSS style code blocks in a HTML document. This code will match the first one but won't match the second. What code would I need to match the second. Can I just get a list of the groups that are inside of my 'style' brackets? Should I call the 'find' method to get the next match?

Here is my regex pattern

^.*(<style type="text/css">)(.*)(</style>).*$

Usage:

final Pattern pattern_css = Pattern.compile(css_pattern_buf.toString(), 
                    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

 final Matcher match_css = pattern_css.matcher(text);
        if (match_css.matches() && (match_css.groupCount() >= 3)) {
            System.out.println("Woot ==>" + match_css.groupCount());
            System.out.println(match_css.group(2));
        } else {
            System.out.println("No Match");
        }
+8  A: 

I am trying to match multiple CSS style code blocks in a HTML document.

Standard Answer: don't use regex to parse HTML. regex cannot parse HTML reliably, no matter how complicated and clever you make your expression. Unless you are absolutely sure the exact format of the target document is totally fixed, string or regex processing is insufficient and you must use an HTML parser.

(<style type="text/css">)(.*)(</style>)

That's a greedy expression. The (.*) in the middle will match as much as it possibly can. If you have two style blocks:

<style type="text/css">1</style> <style type="text/css">2</style>

then it will happily match '1</style> <style type="text/css">2'.

Use (.*?) to get a non-greedy expression, which will allow the trailing (</style>) to match at the first opportunity.

Should I call the 'find' method to get the next match?

Yes, and you should have used it to get the first match too. The usual idiom is:

while (matcher.find()) {
    s= matcher.group(n);
}

Note that standard string processing (indexOf, etc) may be a simpler approach for you than regex, since you're only using completely fixed strings. However, the Standard Answer still applies.

bobince
Thanks, I wasn't aware of matcher.find() either. But then I don't need regexes in Java often :)
sirprize
A: 

You can simplify the regex as follows:

(<style type="text/css">)(.*?)(</style>)

And if you don’t need the groups 1 and 3 (probably not), I would drop the parentheses, remaining only:

<style type="text/css">(.*?)</style>
Gumbo