views:

2467

answers:

3

I have to replace the content of this xml string through java

<My:tag>value_1 22&#xA;value_2 54&#xA;value_3 11</My:tag>

so, this string has been taken from an xml and when I acquire it I have this result:

<My:tag>value_1 22
value_2 54
value_3 11</My:tag>

If I try to replace the content by this way:

String regex =  "(<My:tag>)(.*)(</My:tag>)";
String new_string = old_string.replaceAll(regex,"<My:tag> new_stuff </My:tag>");

I get no result. I think because of the &#xA; symbol

but if I try to replace the string without the &#xA; symbol, everything goes fine.

Suggestions? Thanks

A: 

I'd suggest using an XML library like JDOM or DOM4J for manipulating XML instead of using regular expressions.

Fabian Steeg
JDOM and DOM4J seems overkill if you just need to do a little text-manipulation. You're right if you need to do large scale stuff, but for this... no.
roe
exactly, I need only to manipulate strings like this, I don't think it's the case of using another library...
Giancarlo
Another advantage of using an XML library is that the result of your manipulations is guaranteed to be well-formed XML--which is the whole point of (excuse for?) XML, isn't it?
Alan Moore
+1  A: 

I'm not 100% sure how the java regex-engine works, but I can't possibly imagine that an entity would cause your problems. You should first try to simply remove your brackets, since you're replacing the entire expression, and not extracting anything.

What might be causing it though is if your entity is actually translated to a new-line, it might be the case that your regex won't catch it unless you're explicitly doing a multiline match. You could also try doing

[.\n]*

instead of your

.*

This might be a bid greedy though, and the backtracking to much for the matcher to handle. Unfortunately, I don't have any java stuff installed on this machine, so I can't really try it and test it. One other possibility would be to actively look for the next opening angle bracket, like so:

[^<]*

EDIT:
As you suggested, i tried your link and the following worked perfectly:

Expression:

<My:tag>[^<]*</My:tag>

Replacement:

<My:tag> new_stuff </My:tag>

Test string:

<My:tag>value_1 22&#xA;value_2 54&#xA;value_3 11</My:tag>
roe
this does not seems to work, however I do not use any extra library. If you want you can make some experiment here http://www.fileformat.info/tool/regex.htm
Giancarlo
Java has the (?s) flag for DOTALL mode - enabling . matches newline.Indeed [.] will match an actual . not any character.Also, the [^>]* will work as expected, and doesn't collide with end-of-word.
Peter Boughton
very well, this works [^>]* :) many thanks
Giancarlo
Just to clarify - it should be [^<]* as in roe's edit - [^>]* will work but (I think) will first consume all the closing tag before backtracking out, which is not as good performance.
Peter Boughton
'Fraid you're way off about < and \<. In some flavors, \< means start-of-word and \> means end-of-word, but Perl, Java, and most other flavors use \b to mean start-or-end-of-word. Escaped or not, an angle bracket just matches an angle bracket. http://www.regular-expressions.info/wordboundaries.html
Alan Moore
You're correct that \b means word-boundary in Perl and java, the \< and \> are Posix-regex (what I meant by 'standard', but maybe I'm getting to old). You're right about perl though, I must've confused it with grouping brackets (which are escaped brackets in posix, but not in perl).
roe
+1  A: 

I can't see why the &#xA; itself would cause any issue - not unless it's getting converted to an actual newline at some point.

If this is the case, you need to enable DOTALL mode, so that the . matches newline also (which it doesn't by default).

To enable DOTALL, simply start the expression with (?s)
(if you created a Pattern object, you could also pass the flag in to that.)

Anyway, try this:

String regex =  "(?s)(?<=<(My:tag)>).*?(?=</\1>)";
String new_string = old_string.replaceAll(regex,"new_stuff");


You can also enable it for a specific part of a regex with (?s:regex-segment) for example:

String regex =  "(?<=<(My:tag)>)(?s:.*?)(?=</\1>)";
Peter Boughton
Yes, if the problem description is accurate, the entities have to be getting replaced with linefeeds before the regex ever gets applied. Also, you should have been using a non-greedy dot-star (.*?) all along, but that's even more important when you do the match in DOTALL mode.
Alan Moore
That's a little ambiguous; what I meant was that GIANCARLO should have been using the non-greedy dot-star like Peter did here.
Alan Moore