The regular expression pattern for:
="any characters that aren't a double quote"
Is ="[^"]*"
, which as a Java string literal is "=\"[^\"]*\""
.
The [...]
construct is called a character class; e.g. [aeiou]
matches one of any of the lowercase vowels. The [^...]
construct is a negated character class; e.g. [^aeiou]
matches one of anything but the lowercase vowels (which includes consonants, symbols, digits, etc).
Note that this pattern does not allow escaped "
in the String
(see link below for patterns that account for this possibility).
References
Related questions
On greedy, reluctant, and negated character class matching
To understand why ".+"
doesn't "work" as expected, and why sometimes you see ".+?"
reluctant version to try to "fix" this problem, consider the following example:
Example 1: From A to Z
Let's compare these two patterns: A.*Z
and A.*?Z
.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
Let's first focus on what A.*Z
does. When it matched the first A
, the .*
, being greedy, first tries to match as many .
as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z
doesn't match, the engine backtracks, and .*
must then match one fewer .
:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z
can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z
first matches as few .
as possible, and then taking more .
as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*?
is used instead of the greedy .*
to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z
also finds the same two matches as the A.*?Z
pattern for the above input (as seen on ideone.com). [^Z]
is what is called a negated character class: it matches anything but Z
.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related questions