Speaking strictly to the replacement problem, my preferred solution is one enabled by a feature that should probably be available in the upcoming Scala 2.8, which is the ability to replace regex patterns using a function. Using it, the problem can be reduced to this:
def replaceRegex(input: String, values: IndexedSeq[String]) =
"""\$(\d+)""".r.replaceAllMatchesIn(input, {
case Regex.Groups(index) => values(index.toInt)
})
Which reduces the problem to what you actually intend to do: replace all $N patterns by the corresponding Nth value of a list.
Or, if you can actually set the standards for your input string, you could do it like this:
"select col1 from tab1 where id > %1$s and name like %2$s" format ("one", "two")
If that's all you want, you can stop here. If, however, you are interested in how to go about solving such problems in a functional way, absent clever library functions, please do continue reading.
Thinking functionally about it means thinking of the function. You have a string, some values, and you want a string back. In a statically typed functional language, that means you want something like this:
(String, List[String]) => String
If one considers that those values may be used in any order, we may ask for a type better suited for that:
(String, IndexedSeq[String]) => String
That should be good enough for our function. Now, how do we break down the work? There are a few standard ways of doing it: recursion, comprehension, folding.
RECURSION
Let's start with recursion. Recursion means to divide the problem into a first step, and then repeating it over the remaining data. To me, the most obvious division here would be the following:
- Replace the first placeholder
- Repeat with the remaining placeholders
That is actually pretty straight-forward to do, so let's get into further details. How do I replace the first placeholder? One thing that can't be avoided is that I need to know what that placeholder is, because I need to get the index into my values from it. So I need to find it:
(String, Pattern) => String
Once found, I can replace it on the string and repeat:
val stringPattern = "\\$(\\d+)"
val regexPattern = stringPattern.r
def replaceRecursive(input: String, values: IndexedSeq[String]): String = regexPattern findFirstIn input match {
case regexPattern(index) => replaceRecursive(input replaceFirst (stringPattern, values(index.toInt)))
case _ => input // no placeholder found, finished
}
That is inefficient, because it repeatedly produces new strings, instead of just concatenating each part. Let's try to be more clever about it.
To efficiently build a string through concatenation, we need to use StringBuilder
. We also want to avoid creating new strings. StringBuilder
can accepts CharSequence
, which we can get from String
. I'm not sure if a new string is actually created or not -- if it is, we could roll our own CharSequence
in a way that acts as a view into String
, instead of creating a new String
. Assured that we can easily change this if required, I'll proceed on the assumption it is not.
So, let's consider what functions we need. Naturally, we'll want a function that returns the index into the first placeholder:
String => Int
But we also want to skip any part of the string we have already looked at. That means we also want a starting index:
(String, Int) => Int
There's one small detail, though. What if there's on further placeholder? Then there wouldn't be any index to return. Java reuses the index to return that exception. When doing functional programming however, it is always best to return what you mean. And what we mean is that we may return an index, or we may not. The signature for that is this:
(String, Int) => Option[Int]
Let's build this function:
def indexOfPlaceholder(input: String, start: Int): Option[Int] = if (start < input.lengt) {
input indexOf ("$", start) match {
case -1 => None
case index =>
if (index + 1 < input.length && input(index + 1).isDigit)
Some(index)
else
indexOfPlaceholder(input, index + 1)
}
} else {
None
}
That's rather complex, mostly to deal with boundary conditions, such as index being out of range, or false positives when looking for placeholders.
To skip the placeholder, we'll also need to know it's length, signature (String, Int) => Int
:
def placeholderLength(input: String, start: Int): Int = {
def recurse(pos: Int): Int = if (pos < input.length && input(pos).isDigit)
recurse(pos + 1)
else
pos
recurse(start + 1) - start // start + 1 skips the "$" sign
}
Next, we also want to know what, exactly, the index of the value the placeholder is standing for. The signature for this is a bit ambiguous:
(String, Int) => Int
The first Int
is an index into the input, while the second is an index into the values. We could do something about that, but not that easily or efficiently, so let's ignore it. Here's an implementation for it:
def indexOfValue(input: String, start: Int): Int = {
def recurse(pos: Int, acc: Int): Int = if (pos < input.length && input(pos).isDigit)
recurse(pos + 1, acc * 10 + input(pos).asDigit)
else
acc
recurse(start + 1, 0) // start + 1 skips "$"
}
We could have used the length too, and achieve a simpler implementation:
def indexOfValue2(input: String, start: Int, length: Int): Int = if (length > 0) {
input(start + length - 1).asDigit + 10 * indexOfValue2(input, start, length - 1)
} else {
0
}
As a note, using curly brackets around simple expressions, such as above, is frowned upon by conventional Scala style, but I use it here so it can be easily pasted on REPL.
So, we can get the index to the next placeholder, its length, and the index of the value. That's pretty much everything needed for a more efficient version of replaceRecursive
:
def replaceRecursive2(input: String, values: IndexedSeq[String]): String = {
val sb = new StringBuilder(input.length)
def recurse(start: Int): String = if (start < input.length) {
indexOfPlaceholder(input, start) match {
case Some(placeholderIndex) =>
val placeholderLength = placeholderLength(input, placeholderIndex)
sb.append(input subSequence (start, placeholderIndex))
sb.append(values(indexOfValue(input, placeholderIndex)))
recurse(start + placeholderIndex + placeholderLength)
case None => sb.toString
}
} else {
sb.toString
}
recurse(0)
}
Much more efficient, and as functional as one can be using StringBuilder
.
COMPREHENSION
Comprehensions, at their most basic level, means transforming T[A]
into T[B]
given a function A => B
. It's a monad thing, but it can be easily understood when it comes to collections. For instance, I may transform a List[String]
of names into a List[Int]
of name lengths through a function String => Int
which returns the length of a string. That's a list comprehension.
There are other operations that can be done through comprehensions, given functions with signatures A => T[B]
or A => Boolean
.
That means we need to see the input string as a T[A]
. We can't use Array[Char]
as input because we want to replace the whole placeholder, which is larger than a single char. Let's propose, therefore, this type signature:
(List[String], String => String) => String
Since we the input we receive is String
, we need a function String => List[String]
first, which will divide our input into placeholders and non-placeholders. I propose this:
val regexPattern2 = """((?:[^$]+|\$(?!\d))+)|(\$\d+)""".r
def tokenize(input: String): List[String] = regexPattern2.findAllIn(input).toList
Another problem we have is that we got an IndexedSeq[String]
, but we need a String => String
. There are many ways around that, but let's settle with this:
def valuesMatcher(values: IndexedSeq[String]): String => String = (input: String) => values(input.substring(1).toInt - 1)
We also need a function List[String] => String
, but List
's mkString
does that already. So there's little left to do aside composing all this stuff:
def comprehension(input: List[String], matcher: String => String) =
for (token <- input) yield (token: @unchecked) match {
case regexPattern2(_, placeholder: String) => matcher(placeholder)
case regexPattern2(other: String, _) => other
}
I use @unchecked
because there shouldn't be any pattern other than these two above, if my regex pattern was built correctly. The compiler doesn't know that, however, so I use that annotation to silent the warning it would produce. If an exception is thrown, there's a bug in the regex pattern.
The final function, then, unifies all that:
def replaceComprehension(input: String, values: IndexedSeq[String]) =
comprehension(tokenize(input), valuesMatcher(values)).mkString
One problem with this solution is that I apply the regex pattern twice: once to break up the string, and the other to identify the placeholders. Another problem is that the List
of tokens is an unnecessary intermediate result. We can solve that with these changes:
def tokenize2(input: String): Iterator[List[String]] = regexPattern2.findAllIn(input).matchData.map(_.subgroups)
def comprehension2(input: Iterator[List[String]], matcher: String => String) =
for (token <- input) yield (token: @unchecked) match {
case List(_, placeholder: String) => matcher(placeholder)
case List(other: String, _) => other
}
def replaceComprehension2(input: String, values: IndexedSeq[String]) =
comprehension2(tokenize2(input), valuesMatcher(values)).mkString
FOLDING
Folding is a bit similar to both recursion and comprehension. With folding, we take a T[A]
input that can be comprehended, a B
"seed", and a function (B, A) => B
. We comprehend the list using the function, always taking the B
that resulted from the last element processed (the first element takes the seed). Finally, we return the result of the last comprehended element.
I'll admit I could hardly explained it in a less-obscure way. That's what happens when you try to keep abstract. I explained it that way so that the type signatures involved become clear. But let's just see a trivial example of folding to understand its usage:
def factorial(n: Int) = {
val input = 2 to n
val seed = 1
val function = (b: Int, a: Int) => b * a
input.foldLeft(seed)(function)
}
Or, as a one-liner:
def factorial2(n: Int) = (2 to n).foldLeft(1)(_ * _)
Ok, so how would we go about solving the problem with folding? The result, of course, should be the string we want to produce. Therefore, the seed should be an empty string. Let's use the result from tokenize2
as the comprehensible input, and do this:
def replaceFolding(input: String, values: IndexedSeq[String]) = {
val seed = new StringBuilder(input.length)
val matcher = valuesMatcher(values)
val foldingFunction = (sb: StringBuilder, token: List[String]) => {
token match {
case List(_, placeholder: String) => sb.append(matcher(placeholder))
case List(other: String, _) => sb.append(other)
}
sb
}
tokenize2(input).foldLeft(seed)(foldingFunction).toString
}
And, with that, I finish showing the most usual ways one would go about this in a functional manner. I have resorted to StringBuilder
because concatenation of String
is slow. If that wasn't the case, I could easily replace StringBuilder
in functions above by String
. I could also convert Iterator
into a Stream
, and completely do away with mutability.
This is Scala, though and Scala is about balancing needs and means, not of purist solutions. Though, of course, you are free to go purist. :-)