views:

26

answers:

0

Problem 1: I'd like to generate a set of n strings of fixed length m from alphabet s such that the minimum Levenshtein distance (edit distance) between any two strings is greater than some constant c. Obviously, I can use randomization methods (e.g., a genetic algorithm), but was hoping that this may be a well-studied problem in computer science or mathematics with some informative literature and an efficient algorithm or three.

Problem 2: Same as above except that adjacent characters cannot repeat; the i'th character in each string may not be equal to the i+1'th character. E.g., 'CAT', 'AGA' and 'TAG' are allowed, 'GAA', 'AAT', and 'AAA' are not.

Background: The basis for this problem is bioinformatic and involves designing unique DNA tags that can be attached to biologically derived DNA fragments and then sequenced using a fancy second generation sequencer. The goal is to be able to recognize each tag, allowing for random insertion, deletion, and substitution errors. The specific DNA sequencing technology has a relatively low error rate per base (~1%), but is less precise when a single base is repeated 2 or more times (motivating the additional constraints imposed in problem 2).