ansaurus

Question

Combinatorics : Grouping Characters Challenges

Answer 1

+16 A:

Computing the number of combinations is trivial. Basically, you have a character between each two letters. It might be "epsilon" (empty), or space. So, for n letters you have n-1 such separator characters. As each character can only have two values, that's the same as a binary number of n-1 digits. So you can have 2 to the power of n-1 combinations.

aids = 4 characters (n = 4)
n - 1 = 3
2 ** (n-1) = 8 combinations

Now, for the special cases. If each character can be either lowercase or uppercase, that's 2 to the power of n variations for the characters (as long as they are letters). Now, EACH combination above has 2**n variations based on capitalization, so the end result is (2**(n-1)) * (2**n), which is equal to 2 ** (2*n -1).

For each aditional letter you either double or quadruple (depending on the capitalization thing) the time taken to enumerate them, as can be easily understood from the formula.

The total number of characters is a bit more difficult, but suffice to notice that each "space" is "epsilon" half the time. So we have (2 ** (n-1)) * n (letters) + ((2 ** (n-1)) * (n-1)) / 2 (spaces). In the example:

(2 ** 3) * 4 = 32 characters
(2 ** 3) * 3 / 2 = 12 spaces
Total = 44 characters

Finally, the distance problem is related to the Levenshtein distance. I thought about using a Burkhard-Keller tree, but I see now this is not necessary at all, as the problem is rather simpler.

First, the minimum number of inserts/deletions/changes necessary to make one string equal to another is called the Levenshtein distance. This is directly applicable to the problem: you add spaces, delete spaces, change lowercase/uppercase as needed. Usually, this is best solved with Dynamic Programming, which can be generally thought of as keeping data about the solution to small parts of the problem, which then get reused computing other parts/bigger parts.

But given the particular restrictions of our problem, we can just compare the "binary" numbers representing spaces/epsilon.

Say a function f(word) will return the binary number representing spaces in that word. For example, it will return "010" for "ai ds" and "111" for "a i d s". The number of changes between each combination is given by XORing the results of f (010 xor 111 = 101), and then counting the number of bits equal to 1. Let's write a couple of functions in Scala for that:

def spacesToBinary(s: String) = {
  (s zip s.drop(1)) // transform "ai ds" into ((a,i), (i, ), ( ,d), (d, s))
  .foldLeft(0) { (binary, pair) => pair match {
      case (' ', _) => binary            // A space is always followed by a letter,
                                         // so we ignore it
      case (_, ' ') => (binary << 1) | 1 // If a letter is followed by a space, bit = 1
      case _ => binary << 1              // If not, bit = 0
  }}
}

def numberOfOnes(d: Int) = {
  var bits = 0
  var n = d
  while(n != 0) {
    if ((n & 1) == 1)
      bits += 1
    n >>= 1
  }
  bits
} 

def spaceDistance(s1: String, s2: String) = 
  numberOfOnes(spacesToBinary(s1) ^ spacesToBinary(s2))

Now, comparing lowercase/uppercase is quite easy, once we discard spaces:

def caseDistance(s1: String, s2: String) = {
  val ns1 = s1 filter (_ != ' ')
  val ns2 = s2 filter (_ != ' ')
  (ns1 zip ns2).foldLeft(0){(acc, pair) => acc + (if (pair._1 == pair._2) 0 else 1)}
}

So, the Levenshtein distance is:

def levenshtein(s1: String, s2: String) = spaceDistance(s1, s2) + caseDistance(s1, s2)

We also know the following properties about the Levenshtein distance. Say d(x,y) is the Levenshtein distance between x and y. Then we know:

d(x, y) = 0 if, and only if, x = y
d(x, y) = d(y, x)
d(x, y) + d(y, z) >= d(x, z)

The last criteria i known as the triange inequality. Simply put, the path from x to z is at least as small as the path from x to y plus the path from y to z (think a triangle with vertices x, y and z).

Ok, so let's think about the bonus questions.

How many paths there are between two words? Well, if the Levenshtein distance is n, that means you have "n" unique modifications, a1, a2, ..., an. For each different ordering of these modifications you have a path. The number of permutations of "n" elements is the factorial of "n" (or n!):

def factorial(n: Int): Int = n match {
  case 0 => 1
  case _ => n * factorial(n-1)
}

2! = 2
3! = 6
4! = 24
5! = 120
etc

Is there a "central" combination? Actually, no. If we go back and think of combinations as pairs of binary numbers representing the spaces/no-spaces and lower/uppercases, then it should be obvious that you can simply invert the bits to produce a new combination whose distance to the one chosen is the maximum possible.

Or, in other words, for every combination of n letters, there is one, and only one, corresponding combination, so that the Levenshtein distance between the two combinations is 2*n - 1, the maximum possible distance between any two combinations.

I see I forgot to compute all combinations whose (minimum) distance to s is n. Well, we know each combination can be represented as two binary numbers, representing the spaces and the capitalization of each letter, the first being n-1 digits long, and the second n digits long.

We also know that we can simply invert any of these "digits" to get a difference. So, if we get one big binary number 2*n - 1 digits long, and we enumerate all its digits, the combinations at a minimum distance d from it are given by the combination of the 2*n-1 digits in groups of size "d" with no repetitions. For N=2*n-1, the number of such combination is N!/((N-d)! * d!).

For example, for distance 2 and "aids", n=4, d=2, N=2*4-1=7, and the number of combinations is 7!/(5!*2!) = 7 * 6 / 2 = 21.

We can compose the combinations this way:

def computeCombinations(n: Int, d: Int): List[List[Int]] = {
  def gen(d: Int, l: List[Int]): List[List[Int]] = d match {
    case 0 => List(List())
    case _ => for(el <- l;
                  sl <- gen(d-1, l.dropWhile(_ != el).tail))
              yield el :: sl
  }

  gen(d, (0 until n).toList)
}

This will return lists of lists of letter/spaces to change. We indicate which letter or space needs changing through the number of the bit we want to toggle. To simplify things, we assume the binary number for the capitalization and the binary number for the space/no-space are concatenated in a single binary number.

Next, we have to find a way to produce the variations from this information. The upper/lowercase is simple, assuming we receive the word without spaces:

def toggleCharCase(c: Char) = (c ^ ('A' ^ 'a')).toChar
def toggleCases(s: String, bits: List[Int]) = (
  s.zipWithIndex
  map (t => if (Set(bits: _*) contains t._2) (toggleCharCase(t._1), t._2) else t)
  map (_._1)
)

Toggling spaces is more difficult. We'll use spacesToBinary, defined above, convert that into a list of bit numbers which are set, toggle the requested bits and return that. Afterwards, we use a different function to insert the spaces in the proper places:

def toggleSpaces(s: String, bits: List[Int]): List[Int] = {
  val before = spacesToBinary(s)
  val beforeBits = Set() ++ 
                   (for(i <- 0 to 30; // Assuming 32 bits, 1 for sign
                        if (scala.Math.pow(2, i).toInt & before) != 0)
                    yield i)
  val afterBits = (beforeBits union Set(bits: _*)) diff 
                  (beforeBits intersect Set(bits : _*))
  afterBits.toList
}

def addSpaces(s: String, bits: List[Int]): String = (
  s.filter(_ != ' ').zipWithIndex 
  map (t => t._1.toString + (if (bits contains t._2) " " else ""))
  mkString
)

Now we have to compute the space difference, remove the spaces, toggle cases, and then add the spaces back. Let's see:

def changeCombination(s: String, n: Int, bits: List[Int]) = {
  // Split the binary number with all changes into case changes and space changes
  val spaceBits = bits filter (_ >= n) map (_ - n)
  val caseBits = bits filter (_ < n)

  // Compute the set of spaces after changing them
  val newSpaces = toggleSpaces(s, spaceBits)

  // Remove spaces and toggle case
  val noSpaces = s filter (_ != ' ')
  val caseChanged = toggleCases(noSpaces, caseBits).mkString

  // Now add the spaces as computed before
  val spacesAdded = addSpaces(caseChanged, newSpaces).mkString
  spacesAdded
}

Finally, we compute all combinations, and then produce a modified string for each one:

def makeCombinations(s: String, d: Int) = {
  val n = (s filter (_ != ' ')).length
  for(combination <- computeCombinations(n*2-1, d))
  yield changeCombination(s, n, combination)
}

This code has all been tested on Scala 2.8 (except for some renaming I just made). Here is the result of a run:

scala> makeCombinations("ai ds", 2) foreach println
AI ds
Ai Ds
Ai dS
A i ds
Aids
Ai d s
aI Ds
aI dS
a I ds
aIds
aI d s
ai DS
a i Ds
aiDs
ai D s
a i dS
aidS
ai d S
a ids
a i d s
aid s

Daniel 2009-07-14 18:53:57

wow i didnt expect such huge numbers but your approaches is really awesome. i wish i can think like this someday.

kunjaan 2009-07-14 19:05:58

Daniel could you explain the BK tree implementation for this problem? And could you comment on the Bonus section?

kunjaan 2009-07-14 19:27:12

Well... ok, let's see what I can do.

Daniel 2009-07-14 19:33:42

I am speechless. How did you come up with that? Seriously. I am really interested to know the thought process. How did you decide that "Levenshtein distance" would be appropriate? Anyway Daniel, you owned the thread. Thanks a LOT.

kunjaan 2009-07-14 21:22:49

And by the way, I have never heard someone say "Let's write a couple of functions in Scala for that:" : )

kunjaan 2009-07-14 21:24:03

It just so happens that I knew B-K trees. You asked for more explanation, so I went to the link I posted, for a reference, and saw that Levenshtein difference was enough.

Daniel 2009-07-14 22:07:44

Answer 2

+1 A:

http://www-cs-faculty.stanford.edu/~knuth/fasc3b.ps.gz (download Ghostscript/Ghostview if you can't view postscript) discusses partitions in detail.

For a sequence of length n, there are 2^(n-1) partitions. Think of a bit between each pair of consecutive items. If the bit is set, then they're separated (by a space, as you listed them). "aids" (length 4) has 2^3 possible partitions.

In reply to your other questions:

Time to enumerate: O(n*2^n) - constant in the length of the output. Not only does the number of items increase with input length, but the number of characters in each item increases as well.

Number of characters written: let's not count newlines (if you do, add another 2^(n-1) characters). Then you have n*2^(n-1) nonspace characters, plus the number of 1s in all the unique n-1 digit bit strings. The k digit bit strings, when written out, have k*2^k bits, and half of those are 1. So the total number of characters is [n+(n-1)/2]2^(n-1)), not counting newlines. In your list of the 8 variations on "aids", there are 32 nonspace characters, and 12 spaces - 42^3, and (3/2)*2^3 respectively.

Edit distance: you'll have to be more precise about the transformations and their cost. By "word" I assume you mean a single partition (one of your 8 example lines). If an edit is the removal or addition of a single space, then you're talking about Hamming distance on n-1 digit bit strings.

wrang-wrang 2009-07-14 18:56:32

please see my new edit. thanks.

kunjaan 2009-07-14 19:18:03

Answer 3

+2 A:

As the other answers already said, there are 2^(n-1) possiblities for point 1. About some of the special cases (point 2):

What if the case made a difference like Ai sd and ai sd are two groups?

Well, in that case you had 2^n different case combinations, so you had 2^n * 2^(n-1) = 2^(2 * n - 1) possibilities at all.

If you take the "blank space" as a character. After all the enumeration, how many characters would you have written?

That's a more interesting question. You have 1 possibility to place no space, 3 possibilities to place 1 space, 3 possibilites to place 2 spaces and 1 possibility to place 3 spaces. This is a binomial distribution, if I recall correctly, and there are formulas to calculate the numbers. You can also use Pascal's triangle for this:

   1
  1 1
 1 2 1
1 3 3 1 <- your case (4th row)
  ...

After you got these numbers, calculate the total character count like this:

1*4 + 3*5 + 3*6 + 1*7 = 44

schnaader 2009-07-14 19:06:23

Answer 4

+1 A:

A simple algorithm for visiting each of the words within distance k or less: using a hash table to visit each bitstring just once (or an array of 2^(n-1) bits, but that may be too big), recursively visit each of the adjacent single-edit differences (assuming Hamming distance: for i from 1..(n-1), XOR 2^i with the source bitstring, toggling the ith bit).

Do this to a depth of k (the depth is passed along with your recursion), and you'll have visited all edits within distance k. Of course, if you want just those of exactly depth k, you'll want to use breadth first search: instead of immediately visiting each neighbor, keep them in a queue to be visited. While visiting the queue for items of a given generation (j) (all having the same best edit distance), queue future items in a different queue for the next generation (j+1). This way you visit each string first using the least possible number of edits (breadth first = best first when each transition has the same cost).

If you don't want to do breadth first search, then you can always compute the set of words within k or less, and the set within k-1 or less, and take the difference (you'd use two separate tables). This is effectively "iterative deepening search".

B-K trees aren't appropriate here unless you're considering an unstructured set of words (a general dictionary). We know exactly the structure of the partitions for a single word already.

wrang-wrang 2009-07-14 19:23:36

Could you try the bonus questions? Is it clear enough? I was thinking of the questions as I was typing so I hope they are clear.

kunjaan 2009-07-14 19:26:38

Answer 5

A:

The counting arguments are right.

There's a general way I program problems like this, using branch-and-bound. Here's an example.

Basically, rather than write a loop to scan the string, you write a recursive function, and keep track of cost as one of its arguments. Then at each step, you can 1) step down the string, for an additional cost of zero, then 2) make some small change to the string, add an increment to the cost, and then step forward, and 3) repeat 2 for as many different kinds of changes you want to consider.

Then have an overall cost budget, and refuse to take any branch where the cost would exceed the budget.

Finally, as an outer loop, do the whole thing once with a budget of 0. If that doesn't produce any matches, do it again with a cost of 1, and so forth until you get one or more matches.

Mike Dunlavey 2009-07-14 19:35:59

ansaurus

tags:

views:

answers:

Combinatorics : Grouping Characters Challenges

related questions