views:

13

answers:

0

Hi, I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.

For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.

I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.

My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.

If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.