ansaurus

Question

An interview question - Split text into sub-strings according to rules

Answer 1

A:

The result of your computation will be a partitioning of the given text into short sub-strings containing numerics and long substrings not containing numerics. (This much you knew already).

You will essentially be partitioning off short subs around the numerics and then breaking everything else down into long subs as often as needed to fulfill the length criteria.

Your freedom, i.e. what you can manipulate to improve your result, is to select which characters to include with a numeric. If N = 3, then for every numeric you get the choice of XXN, XNX or NXX. If M is 5 and you have 6 characters before your first numeric, you'll want to include at least one of those characters in your short sub so you won't end up with two "long" strings to the left of your "short" one when you could have just one instead.

As a first approximation, I'd go with extending your "short" strings leftwise far enough to avoid redundant "long" strings. This is a typical "greedy" approach, and greedy approaches often yield optimal or almost-optimal results. To do even better than that would not be easy, and I'm not going to try to figure out how to go about that.

Carl Smotricz 2010-07-11 13:43:31

Greedy approximations are pretty trivial to come up with I think. Do you have any advice on how to build an optimal solution using DP?

IVlad 2010-07-11 13:48:15

I think a guaranteed-optimal solution would be a rather unfair thing to ask for in an interview. Off the top of my head, I'd build a tree of the configurations resulting from each possible decision on each numeric location, and work out ways to prune the branches yielding irrelevant or worse-than-other solutions. Given time to prepare, I'd first hit Russell and Norvig's "Artificial Intelligence - A Modern Approach" or some other book that explains this stuff well. With luck, there may be an example of a similar problem.

Carl Smotricz 2010-07-11 14:16:52

Answer 2

+8 A:

Idea

A greedy approach is the way to go:

If the current text is empty, you're done.
Take the first N characters. If any of them is a digit then this is a new substring. Chop it off and go to beginning.
Otherwise, extend the digitless segment to at most M characters. This is a new substring. Chop it off and go to beginning.

Proof

Here's a reductio-ad-absurdum proof that the above yields an optimal solution. Assume there is a better split than the greedy split. Let's skip to the point where the two splits start to differ and remove everything before this point.

Case 1) A digit among the first N characters.

Assume that there is an input for which chopping off the first N characters cannot yield an optimal solution.

Greedy split:   |--N--|...
A better split: |---|--...
                      ^
                      +---- this segment can be shortened from the left side

However, the second segment of the putative better solution can be always shortened from the left side, and the first one extended to N characters, without altering the number of segments. Therefore, a contradiction: this split is not better than the greedy split.

Case 2) No digit among the first K (N < K <= M) characters.

Assume that there is an input for which chopping off the first K characters cannot yield an optimal solution.

Greedy split:   |--K--|...
A better split: |---|--...
                      ^
                      +---- this segment can be shortened from the left side

Again, the the "better" split can be transformed, without altering the number of segments, to the greedy split, which contradicts the initial assumption that there is a better split than the greedy split.

Therefore, the greedy split is optimal. Q.E.D.

Implementation (Python)

import sys

m, n, text = int(sys.argv[1]), int(sys.argv[2]), sys.argv[3]
textLen, isDigit = len(text), [c in '0123456789' for c in text]

chunks, i, j = [], 0, 0
while j < textLen:
   i, j = j, min(textLen, j + n) 
   if not any(isDigit[i:j]):
      while j < textLen and j - i < m and not isDigit[j]:
         j += 1
   chunks += [text[i:j]]
print chunks

Implementation (Java)

public class SO {
   public List<String> go(int m, int n, String text) {
      if (text == null)
         return Collections.emptyList();
      List<String> chunks = new ArrayList<String>();

      int i = 0;
      int j = 0;
      while (j < text.length()) {
         i = j;         
         j = Math.min(text.length(), j + n);
         boolean ok = true;
         for (int k = i; k < j; k++) 
            if (Character.isDigit(text.charAt(k))) {
               ok = false;              
               break;                   
            }                   
         if (ok)        
            while (j < text.length() && j - i < m && !Character.isDigit(text.charAt(j)))
               j++;                     
         chunks.add(text.substring(i, j));
      }         
      return chunks;
   }    

   @Test
   public void testIt() {
      Assert.assertEquals(
         Arrays.asList("asdas", "d332", "4asd", "fsdxf", "23"),
         go(5, 4, "asdasd3324asdfsdxf23"));
   }    
}

Bolo 2010-07-11 15:49:27

This is absolutely correct solution. (didn't check the code, only idea)

Nikita Rybak 2010-07-11 16:20:27

Ok, I've also posted a proof now :) and +1 for finding a correct solution yourself, btw.

Nikita Rybak 2010-07-11 16:35:05

+1 to both of you, looks right.

IVlad 2010-07-11 16:37:28

An off-topic question here: what does the SO etiquette say about incremental improvements of an answer? My entry was initially a question whether there is a counterexample for a provided algorithm (with working implementation in Python). Later, I've added the proof, Java implementation and removed the question. My edits invalidated some of the comments and potentially some up/down votes. Are incremental edits OK? If not, what is the better course of action.

Bolo 2010-07-11 17:11:42

@Bolo - personally I don't mind incremental improvements. If comments are invalidated, I expect them to be deleted (I deleted mine), and I also expect downvoters to check for improvements on the questions / answers they downvoted. I don't know if there's an "official etiquette" however.

IVlad 2010-07-11 20:25:30

Answer 3

+4 A:

Bolo has provided a greedy algorithm in his answer and asked for a counter-example. Well, there's no counter-example because that's perfectly correct approach. Here's the proof. Although it's a bit wordy, it often happens that proof is longer than algorithm itself :)

Let's imagine we have input of length L and constructed an answer A with our algorithm. Now, suppose there's a better answer B. I.e., B has less segments than A does.

Let's say, first segment in A has length la and in B - lb. la >= lb because we've choosen first segment in A to have maximum possible length. And if lb < la, we can increase length of first segment in B without increasing overall number of segments in B. It would give us some other optimal solution B', having same first segment as A.

Now, remove that first segment from A and B' and repeat operation for length L' < L. Do it until there's no segments left. It means, answer A is equal to some optimal solution.

Nikita Rybak 2010-07-11 16:30:41

ansaurus

tags:

views:

answers:

An interview question - Split text into sub-strings according to rules

Idea

Proof

Implementation (Python)

Implementation (Java)

related questions