tags:

views:

250

answers:

2

Edit: The 100% correct theory is that you don't want to do this at all. However I have accepted the answer that helped the most.

So I'm being given ugly XML from a client that promises to fix it. In the meantime I need to clean it up myself. I'm looking for a regex to use in Java to add quotes around unquoted attributes. The general case is better, but so far it is only one attribute that is broken so the regex can specifically refer to "attr1". The value of the attribute is unknown, so I can't include that in the search.

<tag attr1 = VARIABLETEXT>
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = VARIABLETEXT attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

Should turn into

<tag attr1 = "VARIABLETEXT">
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = "VARIABLETEXT" attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

EDIT: Thank you very much for telling me not to do what I'm trying to do. However, this isn't some random, anything goes XML, where I'll run into all the "don't do it" issues. I have read the other threads. I'm looking for specific help for a specific hack.

+4  A: 

Do not use regex to fix/parse/process markup languages. Read here why.

Use a forgiving parser like tidy to read and fix the document in a few easy steps. There is a Java library (jtidy) you can use.

Tomalak
Thank you for that thread reference. It made life worth living.
prodigitalson
Yeah I've read that. Can anyone just help me with the regex without preaching?
Instantsoup
No, I'm sorry. Because there is no way to get it 100% right, there is always some weird corner case. Why is using a parser not an option?
Tomalak
I'll settle for 89% right then. Thanks for the parser idea. It's not not an option. I just don't have the time to do it right right now, which is why I came here for regex help.
Instantsoup
+2  A: 

OK, given your constraints, you could:

Search for

<tag attr1\s*=\s*([^" >]+)

and replace with

<tag attr1 = "\1"

So, in Java, that could be (according to RegexBuddy):

String resultString = subjectString.replaceAll("<tag attr1\\s*=\\s*([^\" >]+)", "<tag attr1 = \"$1\"");

EDIT: Simplified regex a bit more.

Tim Pietzcker
Sorry, there is definitely a space between the variable text and attr2.
Instantsoup
Oh, in that case, it's a lot easier. Will edit.
Tim Pietzcker