views:

226

answers:

2

In Java, and it seems in a few other languages, backreferences in the pattern is preceded by a slash (e.g. \1, \2, \3, etc), but in a replacement string it's preceded by a dollar sign (e.g. $1, $2, $3, and also $0).

Here's a snippet to illustrate:

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"

System.out.println(
    "left-right".replaceAll("(.*)-(.*)", "$2-$1")   // CORRECT!
); // prints "right-left"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"

System.out.println(
    "You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference

Questions:

  • Is the use of $ for backreferences in replacement strings unique to Java? If not, what language started it? What flavors use it and what don't?
  • Why is this a good idea? Why not stick to the same pattern syntax? Wouldn't that lead to a more cohesive and an easier to learn language?
    • Wouldn't the syntax be more streamlined if statements 1 and 4 in the above were the "correct" ones instead of 2 and 3?
+3  A: 

Is the use of $ for backreferences in replacement strings unique to Java?

No. Perl uses it, and Perl certainly predates Java's Pattern class. Java's regex support is explicitly described in terms of Perl regexes.

For example: http://perldoc.perl.org/perlrequick.html#Search-and-replace

Why is this a good idea?

Well obviously you don't think it is a good idea! But one reason that it is a good idea is to make Java search/replace support (more) compatible with Perl's.

There is another possible reason why $ might have been viewed as a better choice than \. That is that \ has to be written as \\ in a Java String literal.

But all of this is pure speculation. None of us were in the room when the design decisions were made. And ultimately it doesn't really matter why they designed the replacement String syntax that way. The decisions have been made and set in concrete, and any further discussion is purely academic ... unless you just happen to be designing a new language or a new regex library for Java.

Stephen C
+1 agreed... a lot of regular expression engines these days do things the way they do because Perl did it that way. So to really understand it, you'd have to understand the reasoning behind Perl. (Warning: DO NOT TRY THAT AT HOME)
David Zaslavsky
Perl pwns at regex. You see it everywhere nowadays: JavaScript, XML, Java, PHP, etc, etc.
amphetamachine
*"Perl pwns at regex"* - Would someone care to translate that into English for me?
Stephen C
http://en.wikipedia.org/wiki/Pwn :-)
bkail
+1  A: 

After doing some research, I've understood the issues now: Perl had to use a different symbol for pattern backreferences and replacement backreferences, and while java.util.regex.* doesn't have to follow suit, it chooses to, not for a technical but rather traditional reason.


On the Perl side

(Please keep in mind that all I know about Perl at this point comes from reading Wikipedia articles, so feel free to correct any mistakes I may have made)

The reason why it had to be done this way in Perl is the following:

  • Perl uses $ as a sigil (i.e. a symbol attached to variable name).
  • Perl string literals are variable interpolated.
  • Perl regex actually captures groups as variables $1, $2, etc.

Thus, because of the way Perl is interpreted and how its regex engine works, a preceding slash for backreferences (e.g. \1) in the pattern must be used, because if the sigil $ is used instead (e.g. $1), it would cause unintended variable interpolation into the pattern.

The replacement string, due to how it works in Perl, is evaluated within the context of every match. It is most natural for Perl to use variable interpolation here, so the regex engine captures groups into variables $1, $2, etc, to make this work seamlessly with the rest of the language.

References


On the Java side

Java is a very different language than Perl, but most importantly here is that there is no variable interpolation. Moreover, replaceAll is a method call, and as with all method calls in Java, arguments are evaluated once, prior to the method invoked.

Thus, variable interpolation feature by itself is not enough, since in essence the replacement string must be re-evaluated on every match, and that's just not the semantics of method calls in Java. A variable-interpolated replacement string that is evaluated before the replaceAll is even invoked is practically useless; the interpolation needs to happen during the method, on every match.

Since that is not the semantics of Java language, replaceAll must do this "just-in-time" interpolation manually. As such, there is absolutely no technical reason why $ is the escape symbol for backreferences in replacement strings. It could've very well been the \. Conversely, backreferences in the pattern could also have been escaped with $ instead of \, and it would've still worked just as fine technically.

The reason Java does regex the way it does is purely traditional: it's simply following the precedent set by Perl.

polygenelubricants
Within the regex `$` is already being used as an anchor; using it as the sigil for backreferences would have been very messy, if not impossible. In the replacement string, the backslash is used for disambiguation; if `$10` could refer to the tenth group but you want it to mean the first group followed by zero, you write `$1\0` instead. And of course, you use it to escape a literal `$`. That's consistent with its use both in regexes and in Java string literals. So it wasn't a completely arbitrary choice.
Alan Moore