tags:

views:

676

answers:

5

I apologize for not have the math background to put this question in a more formal way. I'm looking to create a string of 796 letters (or integers) with certain properties.

Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded. i.e. ABBB BABA BBBA are each equivalent to {AB}. In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string (i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc) And then producing the set of letters that comprise each group (repetitions and order disregarded)

For example, in the string of 9 letters:

A B B A C E B C D

the first 4-letter groups is: ABBA, which is comprised of the set {AB} the second group is: BBAC, which is comprised of the set {ABC} the third group is: BACE, which is comprised of the set {ABCE} etc.

The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.

For example, if there is a set of 5 letters {A, B, C, D, E} being used Then the possible 1-4 letter combinations are: A, B, C, D, E, AB, AC, AD, AE, BC, BD, BE, CD, CE, DE, ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE, ABCD, ABCE, ABDE, ACDE, BCDE

Here is a working example that uses a set of 5 letters {A, B, C, D, E}.

D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B

The 1st through 4th elements form the set: D The 2nd through 5th elements form the set: DE The 3rd through 6th elements form the set: CDE The 4th through 7th elements form the set: BCDE The 5th through 8th elements form the set: BCE The 6th through 9th elements form the set: BC The 7th through 10th elements form the set: B etc.

* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *

Here is a working solution for 7 letters:

AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE

A: 

Cool problem. Just a draft/psuedo algo:

dim STR-A as string = getall(ABCDEFGHIJKL) 
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for  
dim STR-C as string = "" // to hold the sub-string you are searching in  
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status

while len(STR-A) > 0 
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
  MATCH = false
  STR-B = left(STR-A, 4)
  STR-B = reduce(STR-B) 
  //reduce(str) is a custom re-usable function to sort & remove duplicates

  for i as integer = 1 to len((STR-A) - 1)
    STR-C = substr(STR-A, i, 4) 
    //gives you the 4-character sequence beginning at position i
    STR-C = reduce(STR-C)
    IF STR-B = STR-C Then
       MATCH = true
       exit for 
       //as long as there is even one match, you can throw-away the first 
       //letter
    END IF
    i = i+1
  next


  IF match = false then 
  //if you didn't find a match, then the first letter should be saved
     STR-A-NEW += LEFT(STR-B, 1)
  END IF

  MATCH = false //re-init MATCH
  STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend

Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...

dave
btw -- you mention the first letter should match the last. There's no telling how long STR-A-NEW will be, but if it ends up shorter than your 796 max, then just copy the first character and tag it onto the end of the string. You'll end-up with 1 duplicate combination that way, but as long as that's not illegal...:-)
dave
Great. I should clarify the "begin and end with four of the same letter" bit. The first and last letters don't need to be the same. i.e. AAAA...LLLL
Erik
//custom function to generate concat list of all 793 4-char combos//should be listed side-by-side to form 1472 character-long stringdoes this mean a,b,c,d,...,ab,ac,ad,... or does this mean aaaa,bbbb,cccc,...,aabb,aacc, it seems by the fact that you say this will be 1472 char (instead of 3172) you mean the former, but will that actually ever produce the strings aaaa, bbbb,... because the final solution has to have them. It's an interesting approach. If it really works that seems like a pretty nice and simple solution.
Jack
Thanks Jack. Yes -- I meant 793x4=3172. Don't know where I got the other number. But the point was to have every possible combination stacked side-by-side in one long string. So as you say, AAAA,AAAB,AAAC...all the way to LLLL. (no commas).
dave
I was going to make a comment about the fact that there are 2401 permutations that you need to account for (since the question stated that aabb,abab,etc are the same), but the way you joined them together and moving down the string one char at a time actually hits all the permutations (i think). Very nice. One comment though Shouldn't you check STR-A-NEW for matches as well?
Jack
Check STR-A-NEW for matches? Good point -- i'd thought of that too. The requirement, if I understand it, is to end-up with a 796-letter string. So if you can reduce it to 796 or less on the first pass, I guess you could just stop there. But if it was longer than 796, or you just wanted to keep going anyway to see how short you could possibly get it, then I'd run it again (and again...), at least for as long as it kept getting shorter.
dave
Also there is a requirement to have every reduced possibility once and only once. the reason the limit is 796 characters is because you have 793 reduced combination and they must overlap each other by three characters to ensure that each one is used only once. the limit is not an arbitrary number. To get a valid solution you should hit 796 exactly, if you have more or less it cannot be a valid answer. Unless I miss understood the problem.
Jack
Exactly. And I'm hoping that some pattern will become clear to help make the problem more solvable... perhaps the sequence of numbers of unique letters in each 4-letter set (1, 2, 3, 4, 4, 4, 3, 4...) or how many times each letter should be present in the final 796-length string, etc.
Erik
+2  A: 

Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).

Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf

and

http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf

for a start.

If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.

BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).

Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).

ZXX
A: 

I've been thinking about this one and I'm sketching out a solution.

Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.

Each word abcd has "follow-on" words bcde where a,...,e are all symbols.

Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.

For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.

Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).

You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.

Code to follow after the weekend, with a bit of luck...

Rafe
A: 

Here is my proposal. I'll admit upfront this is a performance and memory hog.

This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.

Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.

Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)

Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)

Jack
Beware that if you construct all possible combinations your complexity becomes at least exponential (worse if you end up backtracking).Can the guy who posted this point to some atricle(s) with a formal definition and a motivation for this problem? Maybe also with some upper and lower limit estimates, prior art etc?
ZXX
I do understand that, as i mentioned in my comment on the question I was looking at an admittedly brute force solution. I think the problem here is we don't understand all of the inner workings of this problem we need a complete solution before we can get an optimal solution. As Eric Said "I'm hoping that some pattern will become clear to help make the problem more solvable". it seems that this is for analysis of the problem. so as much as i agree that this is an exponential solution at this point I fear it is the way to go. At least until we learn more.
Jack
Just mentioning what's going to happen :-) That's probably what the guy who stopped at N=7 did - there's just no CPU to handle exponential growth. OTOH, since these are sets you could determine overlaps by set diff, however they have multiset expansions ({AB} -> {3AB}, {2A2B}, {A3B}) so you can't use bitsets.
ZXX
And yes I agree that we don't know enough - that's why I asked for info on the the source, motivation, formal definition. If someone already proved exponential lower bound or NP equivalence there's no use trying anything. This better not be a Loto "system" attempt :-)
ZXX
even though it is an exponential problem even if you go to all 27 letters of the alphabet that is only 531441 permutations taken 4 at a time. I guess the next question is how far do you need to go with this?
Jack
That's the number of expansions. Exhaustive search would examine all permutaions of them. Even if it could do that just by examining them with constant evaluation time per pair it would still be about (C(N,k+1))! or about 792! just for B(12,4). Things that can help are very early and agressive pruning, or having big symetries (like reversing swaping symbols of concat-ing to the other side) or a recursive construction from smaller to bigger problem - my current best bet. Fell free to pick up the B(5,2) and expand either to B(5,3) or B(6,2) if you can. We are all doing this for fun anyway :-)
ZXX
I said upfront it was going to be a performance and memory hog. It just seems the other solutions here are leaving some parts out of the definition for performance sake. Sometimes you just have to find a solution that works first, then optimize later.
Jack
A: 

If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).

Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).

  1. Initial. Start with k-1 result and inject symbols for singletons (unique expansion empty intersection):
    • ABCDE -> AABBCCDDEE
  2. mark used card=2 sets: AB,BC,CD,DE
  3. Rewriting. Form card=3 sets to inject symbols into marked card=2. 1st feasible lexicographic expansion fires (may have to backtrack for k>2)
    • it's OK to use already marked 2-s since they'll all get replaced
    • but may have to do a verification pass for higher k
    • AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
  4. mark/verify used 2s
    • normally keep marking/unmarking during the construction but also keep keep old mark list
    • marking/unmarking can get expensive if there's backtracking in #3
    • Unused: AB, BE
  5. For higher k may need several recursive rewriting passes
    • possibly partitioning new sets into classes
  6. Finalize: unused 2-s should overlap around the edge (that's why it's cyclic) ABE - B can go to the begining or and: AACBBDCCEDDAEEB

Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A

B(5,2) -> B(5,3) - B(5,4)

  1. Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
  2. no use of marking 3-sets since they are all going to be chenged
  3. Rewriting.
    1. choose systematic insertion positions
      • AAA_CBBB_DCCC_EDDD_AEEE_B
    2. mark all 2-s released by this: AC,AD,BD,BE,CE
    3. use marked 2-s to decide inserted symbols - totice total regularity:
      • AxCB D -> ADCB
      • BxDC E -> BEDC
      • CxED A -> CAED
      • DxAE B => DBAE
      • ExBA C -> ECBA
  4. Verify that 3-s are all used (marked inserted symbols just for fun)
    • AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B

Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))

Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.

  • B(5,3) is so symetrical that just repeating #1 produces B(5.4):
    • AAAADCBBBBEDCCCCAEDDDDBAEEEECB
ZXX
This should be a comment.
BlueRaja - Danny Pflughoeft