views:

277

answers:

4

This is a question on combinatorics from a non-mathematician, so please try to bear with me!

Given an array of n distinct characters, I want to generate subsets of k characters in a minimal-change order, i.e. an order in which generation i+1 contains exactly one character that was not in generation i. That's not too hard in itself. However, I also want to maximise the number of cases in which the character that is swapped out in generation i+1 is the same character that was swapped in in generation i. To illustrate, for n=7, k=3:

abc abd abe* abf* abg* afg aeg* adg* acg* acd ace* acf* aef adf* ade bde bdf bef bcf* bce bcd* bcg* bdg beg* bfg* cfg ceg* cdg* cde cdf* cef def deg dfg efg

The asterisked strings indicate the case I want to maximise; e.g. the e that is new in generation 3, abe, replaces a d that was new in generation 2, abd. It doesn't seem possible to have this happen in every generation, but I want it to happen as often as possible.

Typical array sizes that I use are 20-30 and subset sizes around 5-8.

I'm using an odd language, Icon (or actually its derivative Unicon), so I don't expect anyone to post code that I can used directly. But I will be grateful for answers or hints in pseudo-code, and will do my best to translate C etc. Also, I have noticed that problems of this kind are often discussed in terms of arrays of integers, and I can certainly apply solutions posted in such terms to my own problem.

Thanks

Kim Bastin

Edit 15 June 2010:

I do seem to have got into deeper water than I thought, and while I'm grateful for all answers, not all of them have been relevant. As an example of a solution which is NOT adequate, let me post my own Unicon procedure for generating k-ary subsets of a character set s in a minimal change order. Things you need to know to understand the code are: a preposed * means the size of a structure, so if s is a string, *s means the size of s (the number of characters it contains). || is a string concatenation operation. A preposed ! produces each element of a structure, e.g. each character of a string, in turn on successive passes. And the 'suspend' control structure returns a result from a procedure, but leaves the procedure 'in suspense', with all local variables in place, so that new results can be produced if the procedure is called in a loop.

procedure revdoor(s, k)  
# Produces all k-subsets of a string or character set s in a 'revolving  
# door' order. Each column except the first traverses the characters  
# available to it in alphabetical and reverse alphabetical order  
# alternately. The order of the input string is preserved. 
# If called in a loop as revdoor("abcdefg", 3), 
# the order of production is: abc, abd, abe, abf, abg, acg, acf, ace, acd,  
# ade, adf, adg, aeg, aef, afg, bfg, bef, beg, bdg, bdf, bde, bcd, bce,  
# bcf, bcg, cdg, cdf, cde, cef, ceg, cfg, dfg, deg, def, efg  

local i  
static Ctl  

if /Ctl then {             # this means 'if Ctl doesn't exist'
  if k = 0 then return ""  
  Ctl := list(k, 1)        # a list of k elements, each initialised to 1.
}  

if Ctl[k] = 1 then {  
  if k = 1 then suspend !s else  
    every i := 1 to *s-k+1 do {  
      suspend s[i] || revdoor(s[i+1:0], k-1)  
    }   
  } else {  
    if k = 1 then suspend !reverse(s) else  
    every i := -k to -*s by -1 do {  
      suspend s[i] || revdoor(s[i+1:0], k-1)  
    }  
  }  

  # the following line multiplies element k of Ctl by -1 if k < size of Ctl
  # (this controls the order of generation of characters), 
  # and destroys Ctl on final exit from the procedure.
  if k < *Ctl then Ctl[k] *:= -1 else Ctl := &null   

end  

Note that the output of the above procedure is not optimal in my sense. One result of my investigations so far is that the maximum 'swapping score' for a list of k-ary subsets of n elements is not less than comb(n-1, k), or in the case of n=7, k=3, the maximum score is at least comb(6, 3) = 20. I define the 'swapping score' of a list as the number of items in the list whose new element replaces an element in the previous item which was itself new. I haven't got the mathematical equipment to prove this, but it is easy to see with k=1 or k=2. For certain (n,k) a slightly higher score is possible, as in the case of n=7, k=3:

abc abd abe abf abg
acg adg aeg afg
efg dfg cfg bfg
beg bdg bcg
bcd bce bcf
bdf bef
def cef aef
adf acf
acd ace
ade
bde cde
cdf cdg
ceg
deg (swapping score = 21)

It may be noted that the above list is in 'strong minimal change order' (like word golf: the new character is always in the same position as the character it replaces), which may indicate the direction my own work is taking. I hope to post something more in a few days.

Kim

+1  A: 

It's fairly straightforward. In order to maximise replacement just think of the characters as numbers and increment the string by one till you have reached the upper limit. Then check to see that you don't use the same character twice in the string. I think this would work:

char c[] = {'a', 'b', 'c', 'd', 'e'};
const int n = 5;
const int k = 3;
char s[k];

void print()
{
    for( int i = 0; i < k; ++i )
        putchar(c[s[i]]);
    putchar('\n');
}

bool increment( int m )
{
    // reached the limit?
    if( ++s[m] == n && m == 0 )
        return false;

    next:   
    for( int i = 0; i < m; ++i )
    {
        if( s[m] == n )
        {
            // carry
            s[m] = 0;
            if( !increment( m-1 ))
                return false;
            goto next;
        }
        else if( s[i] == s[m] )
        {
            // the character is already used
            ++s[m];
            goto next;
        }   
    }
    return true;
}

int main(int, char**)
{   
    // initialise
    for( int i = 0; i < k; ++i )
        s[i] = i;

    // enumerate all combinations
    do
        print();
    while(increment(k-1));
}
bitc
As far as I can understand this C (tracing it with pen and paper — I couldn't get it to compile under LCC-win32), its first few results are: abc abd abe acb... If I'm right, this is not what I want, and it seems I haven't formulated the problem completely. I want to produce all k-ary <b>combinations</b>, not permutations, of n characters. See my example for n=7, k=3. acb is a permutation of abc, so should not be produced if abc has been produced previously.Kim Bastin
Kim Bastin
A: 

Kim, your problem description sounds very much like a (homework) attempt to describe the simplest solution for enumerating all k-combinations of a set of n elements - without giving the actual solution away too easily. Anyway, see below for my shot. I used Java but the important parts are not different from C.

public class Homework
{
  /**
   * Prints all k-combinations of a set of n elements. Answer to this 
   * question: http://stackoverflow.com/questions/2698551
   */
  public static void main(String[] args)
  {
    Combinations combinations = new Combinations(7, 3);
    System.out.printf(
        "Printing all %d %d-combinations of a set with %d elements:\n", 
        combinations.size(), combinations.k, combinations.n);
    for (int[] c : combinations)
      System.out.println(Arrays.toString(c));
  }

  /**
   * Provides an iterator for all k-combinations of a set of n elements. 
   */
  static class Combinations implements Iterable<int[]>  
  {
    public final int n, k;

    public Combinations(int n, int k)
    {
      if (n < 1 || n < k)
        throw new IllegalArgumentException();
      this.n = n;
      this.k = k;
    }

    @Override
    public Iterator<int[]> iterator()
    {
      return new Iterator<int[]>()
      {
        private int[] c;

        @Override
        public void remove() { throw new UnsupportedOperationException(); }

        @Override
        public int[] next()
        {
          if (c == null)
          {
            c = new int[k];
            for (int i = 0; i < k; i++)
              c[i] = i;
          }
          else
          {
            int i = c.length - 1;
            while (i >= 0 && c[i] == n - k + i)
              i--;

            if (i < 0)
              throw new NoSuchElementException();

            c[i]++;
            for (int j = i + 1; j < c.length; j++)
              c[j] = c[i] + j - i;
          }
          return c.clone(); // remove defensive copy if performance is more important
        }

        @Override
        public boolean hasNext() { return c == null || c[0] < n - k; }
      };
    }

    /**
     * Returns number of combinations: n! / (k! * (n - k)!).
     */
    public BigInteger size()
    {
      BigInteger s = BigInteger.valueOf(n);
      for (int i = n - 1; i > n - k; i--)
        s = s.multiply(BigInteger.valueOf(i));
      for (int i = k; i > 1; i--)
        s = s.divide(BigInteger.valueOf(i));
      return s;
    }
  }
}

Here is the output for your example:

Printing all 35 3-combinations of a set with 7 elements:
[0, 1, 2] [0, 1, 3] [0, 1, 4] [0, 1, 5] [0, 1, 6] [0, 2, 3] [0, 2, 4] [0, 2, 5] [0, 2, 6] [0, 3, 4] 
[0, 3, 5] [0, 3, 6] [0, 4, 5] [0, 4, 6] [0, 5, 6] [1, 2, 3] [1, 2, 4] [1, 2, 5] [1, 2, 6] [1, 3, 4] 
[1, 3, 5] [1, 3, 6] [1, 4, 5] [1, 4, 6] [1, 5, 6] [2, 3, 4] [2, 3, 5] [2, 3, 6] [2, 4, 5] [2, 4, 6] 
[2, 5, 6] [3, 4, 5] [3, 4, 6] [3, 5, 6] [4, 5, 6] 
deepc
Thank you for your answer. Unfortunately you do not appear to have read my question with sufficient care. I want to generate k-out-of-n combinations in a minimal-change order, i.e. each new generation should change exactly one character. There are well-known algorithms with this property, but your solution does not exhibit it: note the transition [0,1,6] [0,2,3], which changes two elements. The specific requirement in my question — to maximise the immediate replacement of new elements — is not addressed at all.
Kim Bastin
You are right, I should have paid more attention to your minimal-change maximal-swapping requirement. I did not know about minimal-change algorithms (a brief web search proved their existance but I did not see any code without ACM membership etc). Mostly replacing the last elements seems to guarantee a high swap rate for those - but other orderings may maximize that.If the number of combinations is manageable I am sure that at least some linear programming approach can optimize the combination order as desired. A direct solution is preferable, of course.May I ask why you need this?
deepc
A: 

Rather than start with an algorithm, I've tried to think of a way to find a form for the maximum "swapping score", so that you know what to shoot for. Often an algorithm for producing the desired structure can emerge from such a proof.

It's been a long time since university, but I've tried to think of a combinatorial model that will help to figure this out, without very much luck.

I started by imagining the set of combinations as vertices in a graph, with a edges corresponding to the "adjacency" (only one element difference) of the combinations. So:

  • "n choose k" vertices
  • each vertex has degree k(n-k)
  • number of edges = "n choose k" * k(n-k) / 2 = "n choose 2" * "n-2 choose k-1"

There's a lot of symmetry to these graphs. The graph is the same for any given {n,k} as it is for {n,n-k}. If k=1 or k=n-1 it's the complete graph on n vertices (each combinations differs from all the others by only one character). I can't see an obvious algorithm from this, though.

Edit: My next thought was to conceive the graph with a slightly different interpretation. You can think of each {n,k}-combination as a sequence of n bits where there are k 1s. The position of the 1s corresponds to which of the n characters is present in the combination. So for n=7 k=3, abc is 1110000, adg is 1001001, efg is 0000111. With this you can also imagine the points lying at the corners of an n-dimensional hypercube. So for a given subsequence, the edges match your "minimal swapping" criteria if they are co-planar: I think of them as "cutting planes" through the hypercube.

You are looking for a Hamiltonian path through this graph of combinations, one that meets your special criteria.

Another way to think of your problem is to minimize the number of times in the sequence that you do change which character in the combination is being altered.

Zac Thompson
I think an optimised search strategy will be more easily found than an algorithm which produces a maximised-swapping order directly. It's easy to start such an order — see the listing for n=7, k=3 below the code I posted — and easy to see patterns. Continuing to the end is harder, and any pattern is hard to discern. I hope you keep thinking along your lines, which are roughly parallel to mine.
Kim Bastin
Thanks for nudging me in the direction of graph theory. I find my problem is equivalent to the minimal clique cover problem, i.e. finding the smallest set of cliques that includes every vertex in the graph... which is an NP-complete problem :(
Kim Bastin
A: 

For a good answer, would computing the list of combinations all at once be acceptable, or do you need to compute them one at a time? In other words, do you need a function:

Combination nextCombo();

or would

vector<Combination> allCombinations();

be acceptable?

If computing the combinations in batch is permissible, it is possible that an iterative-deepening A* search (or just an A* search but I suspect it'd run out of memory) would work. With an admissible heuristic, A* is guaranteed to give the optimum. I'm short of time, so I decided to post this partial answer and edit the post if I get time to write code.

A* is a graph search algorithm. In this case, the nodes are lists of combinations used so far (no duplicates allowed in the list). My plan was to use a bit-string representation for the nodes. n=30 would fit into a 32 bit integer. We can arbitrarily permute any solution so that the first combination begins with 0's and ends in 1's, i.e. 000...1111. A node with a shorter list is connected to a longer one if the two lists are the same up until the last element and the last element differs only in having one 0'bit flipped to a 1 and one 1 bit flipped to a 0. The distance between the two is 0 if there was a "swap" and 1 if there was a swap.

A second representation for each combination is a sorted list of the bits that are turned on. One possible admissible heuristic for this graph is to use this index list. Each time (in the list of combinations) an index is used at a particular position in the index list, mark it off. The number of positions with un-used indices - the current last changed index is (I believe) the minimal number of swaps that need to happen.

To illustrate this heuristic, consider the sequence abc abd abe* abf* abg afg from above. (the letters would be numbers in my treatment, but that is a minor difference). This sequence (which would be one node in the search-graph) would have the following places marked:

 1   2   3
*a      
 b  *b   
 c   c  *c
 d   d  *d
 e   e  *e
    *f  *f
        *g

Thus the heuristic would predict that there is at least one swap required (since there are no unmarked elements in position 3 and the current position is 2).

If I get the time, I'll edit this to post code and performance of the algorithm.


Re: the NP completeness result (in a comment to Zac Thompson's answer). The graph on which we are searching for a minimal cost Hamiltonian path has a very special structure. For example, the normally NP-complete Hamiltonian Path problem can be solved in O(n) time with the "enumerate all combinations" algorithm - with n being the number of nodes in the graph. This structure makes it possible that, though on a general graph, vertex cover is hard, on your graph it may be polynomial (even linear or quadratic). Of course, since the graph has a lot of nodes for e.g. n=30, k=8 you may still have a lot of computation ahead of you.

Eponymous
Thank you. It looks as though we are getting somewhere. In reply to your opening question, an answer in either form — subset by subset, or all at once in a list — would be equally acceptable. Notwithstanding the apparently discouraging conclusion that the problem in general terms is NP-complete, I have made some progress myself in the last few days. However, I now have to go away on a trip shortly and won't be able to post more on this question until 5 July. Hopefully one or both of us (or someone else) may have cracked the problem.
Kim Bastin