I was assigned a problem to find genes when given a string of the letters A,C,G, or T all in a row, like ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA. A gene is started with ATG, and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). The gene consists of triplets of letters, so its length is a multiple of three, and none of those triplets can be the start/end triplets listed above. So, for the string above the genes in it are CTCTCT and CACACACACACA. And in fact my regex works for that particular string. Here's what I have so far (and I'm pretty happy with myself that I got this far):
(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
However, if there is an ATG and end-triplet within another result, and not aligned with the triplets of that result, it fails. For example:
Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG :
TTGCTTATTGTTTTGAATGGGGTAGGA
ACCTGC
It should find also a GGG but doesn't: TTGCTTATTGTTTTGA(ATG|GGG|TAG)GA
I'm new to regex in general and a little stuck...just a little hint would be awesome!