views:

134

answers:

4

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.

<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->

Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.

The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.

A: 

the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.

You can disable this by using *?, so try:

(\".*?\")
Jeremy Smyth
+1  A: 

I would replace that .* with [\w-]* for example if name is an identifier of some sort.

Or [^\"]* so it doesn't capture the end double quote.

Edit:

As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).

Edit 2:

It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.

<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->

Or when you are filtering on XML based search:

"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"

Edit 3: Fixed the escapes twice. Thanks Alan M.

kd304
\w matches letters, digits and the underscore, so [\w\d\-\_] would only need to be [\w-] (the hyphen doesn't need to be escaped if it's the first or last character listed).
Alan Moore
+5  A: 

XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.

Choose an XML parser.

Svante
Even if you have a case where the input data is guaranteed to be nesting-free, *ML is still complex enough that hand-rolled regexes will generally be incorrect outside of extremely narrow applications. So use a real XML parser even if your current data is simple enough for regexes to deal with.
Dave Sherohman
+1 -- @Svante: I'm sure it is possible to make it to the daily "200 rep"-cap with "don't do XML/HTML with regex" posts alone. ;-)
Tomalak
@kd304, using a proper parser is generally quicker and easier than fiddling around with the wrong tool. Regex is not a magic black box, it is a tool for parsing regular languages.
Svante
@Tomalak, I share this feeling.
Svante
objwz: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
Jason Day
@Jason Day: And then there are those people who know a popular quote about regular expressions. SCNR ;-)
Tomalak
Sorry, I wanted to remove the downvote as I now found a hole in your argument but can't unless you 'edit' your answer.
kd304
I'm sorry, I changed my mind about the -1 but the site does not allow to revoke it any more. I made fav+upvote on one of your questions.
kd304
+1  A: 

As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.

Pattern p = Pattern.compile(
    "<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
    "(.*?)" +
    "<!--\\s*</editable>\\s*-->",
    Pattern.DOTALL);

I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.

I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").

Alan Moore
Oh, didn't notice my answer was ill escaped. Thanks.
kd304