views:

103

answers:

4

I was assigned a problem to find genes when given a string of the letters A,C,G, or T all in a row, like ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA. A gene is started with ATG, and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). The gene consists of triplets of letters, so its length is a multiple of three, and none of those triplets can be the start/end triplets listed above. So, for the string above the genes in it are CTCTCT and CACACACACACA. And in fact my regex works for that particular string. Here's what I have so far (and I'm pretty happy with myself that I got this far):

(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)

However, if there is an ATG and end-triplet within another result, and not aligned with the triplets of that result, it fails. For example:

Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG :
TTGCTTATTGTTTTGAATGGGGTAGGA
ACCTGC

It should find also a GGG but doesn't: TTGCTTATTGTTTTGA(ATG|GGG|TAG)GA

I'm new to regex in general and a little stuck...just a little hint would be awesome!

+2  A: 

The problem is that the regular expression consumes the characters that it matches and then they are not used again.

You can solve this by either using a zero-width match (in which case you only get the index of the match, not the characters that matched).

Alternatively you can use three similar regular expressions, but each using a different offset:

(?=(.{3})+$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+.$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+..$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)

You might also want to consider using a different approach that doesn't involve regular expressions as the above regular expression would be slow.

Mark Byers
+2  A: 

The problem with things like this is that you can slowly build up a regex, rule by rule, until you have something taht works.

Then your requirements change and you have to start all over again, because its nearly impossible for mere mortals to easily reverse engineer a complex regex.

Personally, I'd rather do it the 'old fashioned' way - use string manipulation. Each stage can be easily commented, and if there's a slight change in the requirements you can just tweak a particular stage.

Visage
A: 

Perhaps you should try with other methods like working with indexes. Something like :

public static final String genome="ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA";
public static final String start_codon = "ATG";
public final static String[] end_codons = {"TAA","TAG","TGA"};

public static void main(String[] args) {
    List<Integer>start_indexes = new ArrayList<Integer>();
    int curIndex = genome.indexOf(start_codon);
        while(curIndex!=-1){
            start_indexes.add(curIndex);
            curIndex = genome.indexOf(start_codon,curIndex+1);
        }
}

do the same for other codons, and see if indexes match the triplet rule. By the way, are you sure that a gene exclude a start codon? (some ATG can be found in a gene)

oyo
Actually it's a textbook problem, and it's telling me to exclude ATG. Also, the problem doesn't require regex and your solution is the one I SHOULD be doing, but I thought regex would be a fun challenge.
Swordbeard
+1  A: 

Here's a possible regex:

(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))

A little test-rig:

public class Main {
    public static void main(String[]args) {
        String source = "TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG";
        Matcher m = Pattern.compile("(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))").matcher(source);
        System.out.println("source : "+source+"\nmatches:");
        while(m.find()) {
            System.out.print("         ");
            for(int i = 0; i < m.start(); i++) {
                System.out.print(" ");
            }
            System.out.println(m.group(1));
        }
    }
}

which produces:

source : TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG
matches:
             ATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGA
                                ATGGGGTAG
                                          ATGACCTGCTAA
                                                                     ATGTAG
Bart Kiers