views:

69

answers:

4

I am given a string, and a set of rules which select valid substrings by a process which isn't important here. Given an enumeration of all valid substrings, I have to find the optimum set of substrings according to a set of ranked criteria, such as:

  1. Substrings may not overlap
  2. All characters must be part of a substring if possible
  3. Use as few different substrings as possible
  4. etc.

For example, given the string abc and the substrings [a, ab, bc], the optimal set of substrings by the preceding rules is [a, bc].

Currently I'm doing this by a standard naive algorithm of enumerating all possible sets of substrings, then iterating over them to find the best candidate. The problem is that as the length of the string and the number of substrings goes up, the number of possible sets increases exponentially. With 50 substrings (well within possibility for this app), the number of sets to enumerate is 2^50, which is extremely prohibitive.

It seems like there should be a way to avoid generating many of the sets that will obviously be losers, or to algorithmically converge on the optimum set without having to blindly generate every candidate. What options are there?

Note that for this application it may be acceptable to use an algorithm that offers a statistical rather than absolute guarantee, such as an n% chance of hitting a non-optimal candidate, where n is suitably small.

+1  A: 

This smells like a dynamic programming problem. You can find a number of good sources on it, but the gist is that you generate a collection of subproblems, and then build up "larger" optimal solutions by combining optimal subsolutions.

Charlie Martin
+2  A: 

Here's a working solution in Haskell. I have called the unique substrings symbols, and an association of one occurrence of the substrings a placement. I have also interpreted criterion 3 ("Use as few different substrings as possible") as "use as few symbols as possible", as opposed to "use as few placements as possible".

This is a dynamic programming approach; the actual pruning occurs due to the memoization. Theoretically, a smart haskell implementation could do it for you, (but there are other ways where you wrap makeFindBest), I'd suggest using a bitfield to represent the used symbols and just an integer to represent the remaining string. The optimisation is possible from the fact that: given optimal solutions for the strings S1 and S2 that both use the same set of symbols, if S1 and S2 are concatenated then the two solutions can be concatenated in a similar manner and the new solution will be optimal. Hence for each partition of the input string, makeFindBest need only be evaluated once on the postfix for each possible set of symbols used in the prefix.

I've also integrated branch-and-bound pruning as suggested in Daniel's answer; this makes use of an evaluation function which becomes worse the more characters skipped. The cost is monotonic in the number of characters processed, so that if we have found a set of placements that wasted only alpha characters, then we never again try to skip more than alpha characters.

Where n is the string length and m is the number of symbols, the worst case is O(m^n) naively, and m is O(2^n). Note that removing constraint 3 would make things much quicker: the memoization would only need to be parameterized by the remaining string which is an O(n) cache, as opposed to O(n * 2^m)!

Using a string search/matching algorithm such as Aho-Corasick's string matching algorithm, improves the consume/drop 1 pattern I use here from exponential to quadratic. However, this by itself doesn't avoid the factorial growth in the combinations of the matches, which is where the dynamic programming helps.

Also note that your 4th "etc." criteria could possibly change the problem a lot if it constrains the problem in a way that makes it possible to do more aggressive pruning, or requires backtracking!

module Main where

import List
import Maybe
import System.Environment

type Symbol = String
type Placement = String

-- (remaining, placement or Nothing to skip one character)
type Move = (String, Maybe Placement)

-- (score, usedsymbols, placements)
type Solution = (Int, [Symbol], [Placement])

-- invoke like ./a.out STRING SPACE-SEPARATED-SYMBOLS ...
-- e.g. ./a.out "abcdeafghia" "a bc fg"
-- output is a list of placements
main = do
  argv <- System.Environment.getArgs
  let str = head argv
      symbols = concat (map words (tail argv))
  (putStr . show) $ findBest str symbols
  putStr "\n"

getscore :: Solution -> Int
getscore (sc,_,_) = sc

-- | consume STR SYM consumes SYM from the start of STR.  returns (s, SYM)
-- where s is the rest of STR, after the consumed occurrence, or Nothing if
-- SYM isnt a prefix of STR.
consume :: String -> Symbol -> Maybe Move
consume str sym = if sym `isPrefixOf` str
                  then (Just (drop (length sym) str, (Just sym)))
                  else Nothing

-- | addToSoln SYMBOLS P SOL incrementally updates SOL with the new SCORE and
-- placement P
addToSoln :: [Symbol] -> Maybe Placement -> Solution -> Solution
addToSoln symbols Nothing (sc, used, ps) = (sc - (length symbols) - 1, used, ps)
addToSoln symbols (Just p) (sc, used, ps) = 
  if p `elem` symbols
  then (sc - 1, used `union` [p], p : ps)
  else (sc, used, p : ps)

reduce :: [Symbol] -> Solution -> Solution -> [Move] -> Solution
reduce _ _ cutoff [] = cutoff
reduce symbols parent cutoff ((s,p):moves) =
    let sol = makeFindBest symbols (addToSoln symbols p parent) cutoff s
        best = if (getscore sol) > (getscore cutoff)
               then sol
               else cutoff
    in reduce symbols parent best moves

-- | makeFindBest SYMBOLS PARENT CUTOFF STR searches for the best placements
-- that can be made on STR from SYMBOLS, that are strictly better than CUTOFF,
-- and prepends those placements to PARENTs third element.
makeFindBest :: [Symbol] -> Solution -> Solution -> String -> Solution
makeFindBest _ cutoff _ "" = cutoff
makeFindBest symbols parent cutoff str =
  -- should be memoized by (snd parent) (i.e. the used symbols) and str
  let moves = if (getscore parent) > (getscore cutoff)
              then (mapMaybe (consume str) symbols) ++ [(drop 1 str, Nothing)]
              else (mapMaybe (consume str) symbols)
  in reduce symbols parent cutoff moves

-- a solution that makes no placements
worstScore str symbols = -(length str) * (1 + (length symbols))

findBest str symbols =
  (\(_,_,ps) -> reverse ps)
  (makeFindBest symbols (0, [], []) (worstScore str symbols, [], []) str)
p00ya
Yay! I THOUGHT I smelled dynamic programming.
Charlie Martin
Just so you know, I'm examining this and translating it into my implementation language (C#). If it works, I'll accept your answer.
JSBangs
@JSBangs, you'll want to use my second answer since you're using an imperative language.
p00ya
@p00ya, C# adapts itself very well to a functional style, so I don't think this would be too hard to implement as-is. But I haven't had much time to work on it lately :(.
JSBangs
well it's not just the style, the C++ Aho-Corasick/Dijkstra answer is much quicker too, but both of those algorithms are tricky to write in pure functional style with lazy evaluation and referential transparency, hence why I started over with a new answer.
p00ya
+2  A: 

Looks to me like a tree structure is needed.

Basically your initial branching is on all the substrings, then all but the one you used in the first round etc all the way to the bottom. You're right in that this branches to 2^50 but if you use ab-pruning to quickly terminate branches that are obviously inferior and then add some memoization to prune situations you've seen before you could speed up considerably.

You'll probably have to do a fair amount of AI learning to get it all but wikipedia pages on ab-pruning and transposition tables will get you a start.

edit: Yep you're right, probably not clear enough. Assuming your example "ABABABAB BABABABA" with substrings {"ABAB","BABA"}. If you set your evaluation function to simply treat wasted characters as bad the tree will go something like this:

ABAB (eval=0)
  ABAB (eval=0)
    ABAB (eval=2 because we move past/waste a space char and a B)
      [missing expansion]
    BABA (eval=1 because we only waste the space)
      ABAB (eval=2 now have wasted the space above and a B at this level)
      BABA (eval=1 still only wasted the space)*
  BABA (eval=1 prune here because we already have a result that is 1)
BABA (eval=1 prune here for same reason)

*best solution

I suspect the simple 'wasted chars' isn't enough in the non trivial example but it does prune half the tree here.

Daniel
The 1-substring-at-a-time branching pattern won't work, consider "ABABABAB BABABABA" with substrings ["ABAB", "BABA"]. Discovered 1st-ply solutions are "[ABAB][ABAB] B[ABAB]ABA" ("ABAB" first) and "A[BABA]BAB [BABA][BABA]", both of which preclude the optimal solution of "[ABAB][ABAB] [BABA][BABA]". Furthermore what heuristic function are you using at non-terminal nodes for the pruning? There's no alpha-beta here, since it's not a minmax-type game (unless you're doing some transformation which you haven't explained at all).
p00ya
Ah, looks good now :)
p00ya
A: 

This is an answer rewritten to use the Aho-Corasick string-matching algorithm and Dijkstra's algorithm, in C++. This should be a lot closer to your target language of C#.

The Aho-Corasick step constructs an automaton (based on a suffix tree) from the set of patterns, and then uses that automaton to find all matches in the input string. Dijkstra's algorithm then treats those matches as nodes in a DAG, and moves toward the end of the string looking for the lowest cost path.

This approach is a lot easier to analyze, as it's simply combining two well-understood algorithms.

Constructing the Aho-Corasick automaton is linear time in the length of the patterns, and then the search is linear in the input string + the cumulative length of the matches.

Dijkstra's algorithm runs in O(|E| + |V| log |V|) assuming an efficient STL. The graph is a DAG, where vertices correspond to matches or to runs of characters that are skipped. Edge weights are the penalty for using an extra pattern or for skipping characters. An edge exists between two matches if they are adjacent and non-overlapping. An edge exists from a match m to a skip if that is the shortest possible skip between m and another match m2 that overlaps with some match m3 starting at the same place as the skip (phew!). The structure of Dijkstra's algorithm ensures that the optimal answer is the first one to be found by the time we reach the end of the input string (it achieves the pruning Daniel suggested implicitly).

#include <iostream>
#include <queue>
#include <vector>
#include <list>
#include <string>
#include <algorithm>
#include <set>

using namespace std;

static vector<string> patterns;
static string input;
static int skippenalty;

struct acnode {
      acnode() : failure(NULL), gotofn(256) {}
      struct acnode *failure;
      vector<struct acnode *> gotofn;
      list<int> outputs; // index into patterns global
};

void
add_string_to_trie(acnode *root, const string &s, int sid)
{
   for (string::const_iterator p = s.begin(); p != s.end(); ++p) {
      if (!root->gotofn[*p])
     root->gotofn[*p] = new acnode;
      root = root->gotofn[*p];
   }
   root->outputs.push_back(sid);
}

void
init_tree(acnode *root)
{
   queue<acnode *> q;
   unsigned char c = 0;
   do {
      if (acnode *u = root->gotofn[c]) {
         u->failure = root;
         q.push(u);
      } else
         root->gotofn[c] = root;
   } while (++c);
   while (!q.empty()) {
      acnode *r = q.front();
      q.pop();

      do {
         acnode *u, *v;
         if (!(u = r->gotofn[c]))
            continue;
         q.push(u);
         v = r->failure;
         while (!v->gotofn[c])
            v = v->failure;
         u->failure = v->gotofn[c];
         u->outputs.splice(u->outputs.begin(), v->gotofn[c]->outputs);
      } while (++c);
   }
}

struct match { int begin, end, sid; };

void
ahocorasick(const acnode *state, list<match> &out, const string &str)
{
   int i = 1;
   for (string::const_iterator p = str.begin(); p != str.end(); ++p, ++i) {
      while (!state->gotofn[*p])
         state = state->failure;
      state = state->gotofn[*p];
      for (list<int>::const_iterator q = state->outputs.begin();
           q != state->outputs.end(); ++q) {
         struct match m = { i - patterns[*q].size(), i, *q };
         out.push_back(m);
      }
   }
}

////////////////////////////////////////////////////////////////////////
bool operator<(const match& m1, const match& m2)
{
   return m1.begin < m2.begin
      || (m1.begin == m2.end && m1.end < m2.end);
}

struct dnode {
      int usedchars;
      vector<bool> usedpatterns;
      int last;
};

bool operator<(const dnode& a, const dnode& b) {
   return a.usedchars > b.usedchars
      || (a.usedchars == b.usedchars && a.usedpatterns < b.usedpatterns);
}
bool operator==(const dnode& a, const dnode& b) {
   return a.usedchars == b.usedchars
      && a.usedpatterns == b.usedpatterns;
}

typedef priority_queue<pair<int, dnode>,
               vector<pair<int, dnode> >,
               greater<pair<int, dnode> > > mypq;

void
dijkstra(const vector<match> &matches)
{
   typedef vector<match>::const_iterator mIt;
   vector<bool> used(patterns.size(), false);
   dnode initial = { 0, used, -1 };
   mypq q;

   set<dnode> last;
   dnode d;

   q.push(make_pair(0, initial));
   while (!q.empty()) {
      int cost = q.top().first;
      d = q.top().second;
      q.pop();

      if (last.end() != last.find(d)) // we've been here before
         continue;

      last.insert(d);
      if (d.usedchars >= input.size()) {
         break; // found optimum
      }

      match m = { d.usedchars, 0, 0 };      
      mIt mp = lower_bound(matches.begin(), matches.end(), m);

      if (matches.end() == mp) {
         // no more matches, skip the remaining string
         dnode nextd = d;
         d.usedchars = input.size();
         int skip = nextd.usedchars - d.usedchars;
         nextd.last = -skip;

         q.push(make_pair(cost + skip * skippenalty, nextd));
         continue;
      }

      // keep track of where the shortest match ended; we don't need to
      // skip more than this.
      int skipmax = (mp->begin == d.usedchars) ? mp->end : mp->begin + 1;
      while (mp != matches.end() && mp->begin == d.usedchars) {
         dnode nextd = d;
         nextd.usedchars = mp->end;
         int extra = nextd.usedpatterns[mp->sid] ? 0 : 1; // extra pattern
         int nextcost = cost + extra;
         nextd.usedpatterns[mp->sid] = true;
         nextd.last = mp->sid * 2 + extra; // encode used pattern
         q.push(make_pair(nextcost, nextd));
         ++mp;
      }

      if (mp == matches.end() || skipmax <= mp->begin)
         continue;

      // skip
      dnode nextd = d;
      nextd.usedchars = mp->begin;
      int skip = nextd.usedchars - d.usedchars;
      nextd.last = -skip;
      q.push(make_pair(cost + skip * skippenalty, nextd));
   }

   // unwind
   string answer;
   while (d.usedchars > 0) {
      if (0 > d.last) {
         answer = string(-d.last, '*') + answer;
         d.usedchars += d.last;
      } else {
         answer = "[" + patterns[d.last / 2] + "]" + answer;
         d.usedpatterns[d.last / 2] = !(d.last % 2);
         d.usedchars -= patterns[d.last / 2].length();
      }

      set<dnode>::const_iterator lp = last.find(d);
      if (last.end() == lp) return; // should not happen
      d.last = lp->last;
   }
   cout << answer;
}

int
main()
{
   int n;
   cin >> n; // read n patterns
   patterns.reserve(n);
   acnode root;
   for (int i = 0; i < n; ++i) {
      string s;
      cin >> s;
      patterns.push_back(s);
      add_string_to_trie(&root, s, i);
   }
   init_tree(&root);

   getline(cin, input); // eat the rest of the first line
   getline(cin, input);
   cerr << "got input: " << input << endl;
   list<match> matches;
   ahocorasick(&root, matches, input);

   vector<match> vmatches(matches.begin(), matches.end());
   sort(vmatches.begin(), vmatches.end());
   skippenalty = 1 + patterns.size();

   dijkstra(vmatches);
   return 0;
}

Here is a test file with 52 single-letter patterns (compile and then run with the test file on stdin):

52 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
p00ya