views:

73

answers:

3
+1  Q: 

Java Regex problem

Hi there,

I'm making an XMLParser for a Java program (I know there are good XMLParsers out there but I just want to do it).

I have a method called getAttributeValue(String xmlElement, String attribute) and am using regex to find a sequence of characters that have the attribute name plus

="any characters that aren't a double quote"

I can then parse the contents of the quotes. Unfortunately, I'm having trouble with the regex pattern. If I use:

Pattern p = Pattern.compile(attribute + "=\"(.)+\"");

Then I get a string starting with my attribute name, but because there are loads of attributes and values and the last one's value has the double quotes, I get the string I want plus all the other attribute names and values like so:

attributeOne="contents" attributeTwo="contents2" attributeThree="contents3"

So I thought that I could have a regex pattern that, instead of the "." any characters symbol, would have "any characters but not a double quote". I have tried:

Pattern p = Pattern.compile(attribute + "=\"(.&&[^\"])+\"");
Pattern p = Pattern.compile(attribute + "=\"(.&&(^\"))+\"");
Pattern p = Pattern.compile(attribute + "=\"([.&&[^\"]]+)\"");

but none of them work. I'd be grateful for any suggestions and comments.

Thanks.

+1  A: 

try this:

attribute + "=\".*?\""

The reason for this is: * instead of + because you can have an empty atribute: something=""
*? instead of * to make it reluctant instead of greedy.
regular expressions tutorial on repetition

Professor_Calculus
Great. Thank you very much for your reply.
Joe
+1  A: 
attribute + "=\"[^\"]*\""

should work. But what do you do if the string you're matching against might contain escaped quotes itself? Do you anticipate a need to handle this?

In that case, you could use

attribute + "=\"(?:\\\\.|[^\"])*\""
Tim Pietzcker
xml escaped quotes are `"`
Professor_Calculus
Hi Tim,Thanks very much for your reply. I've used the example you gave. Could you please explain what the ?: characters do? I can read that it's any character or any character thats not a double quote, zero to many times. But thats not all because of the ?: preceding.I understand that ? after the + or * makes the operator reluctant so that the first match is taken without reading the whole input, but have not seen it used preceding something.
Joe
?: is a lookaround
Professor_Calculus
@Professor_Calculus: No, it's not a lookaround. `(?:...)` is the same as `(...)` with one difference: The group's contents are not captured for later re-use, making the regex a little more efficient since we're not using the content of the group. We just need it because of the alternation (`a(b|c)d` vs. `ab|cd`) and because we're repeating it with the `*`.
Tim Pietzcker
+6  A: 

The regular expression pattern for:

="any characters that aren't a double quote"

Is ="[^"]*", which as a Java string literal is "=\"[^\"]*\"".

The [...] construct is called a character class; e.g. [aeiou] matches one of any of the lowercase vowels. The [^...] construct is a negated character class; e.g. [^aeiou] matches one of anything but the lowercase vowels (which includes consonants, symbols, digits, etc).

Note that this pattern does not allow escaped " in the String (see link below for patterns that account for this possibility).

References

Related questions


On greedy, reluctant, and negated character class matching

To understand why ".+" doesn't "work" as expected, and why sometimes you see ".+?" reluctant version to try to "fix" this problem, consider the following example:

Example 1: From A to Z

Let's compare these two patterns: A.*Z and A.*?Z.

Given the following input:

eeeAiiZuuuuAoooZeeee

The patterns yield the following matches:

Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.

eeeAiiZuuuuAoooZeeee
   \_______________/
    A.* matched, Z can't match

Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:

eeeAiiZuuuuAoooZeeee
   \______________/
    A.* matched, Z still can't match

This happens a few more times, until finally we come to this:

eeeAiiZuuuuAoooZeeee
   \__________/
    A.* matched, Z can now match

Now Z can match, so the overall pattern matches:

eeeAiiZuuuuAoooZeeee
   \___________/
    A.*Z matched

By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.

Here's a visual representation of what the two patterns matched:

eeeAiiZuuuuAoooZeeee
   \__/r   \___/r      r = reluctant
    \____g____/        g = greedy

Example: An alternative

In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.

The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.

The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.

References


Example 2: From A to ZZ

This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.

eeAiiZooAuuZZeeeZZfff

These are the matches for the above input:

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

Related questions

polygenelubricants
Fantasic. Thank you so much for your time and help. Very much appreciated.
Joe