views:

83

answers:

5

I'm trying to split up a nucleotide sequence into amino acid strings using a regular expression. I have to start a new string at each occurrence of the string "ATG", but I don't want to actually stop the first match at the "ATG". Valid input is any ordering of a string of As, Cs, Gs, and Ts.

For example, given the input string: ATGAACATAGGACATGAGGAGTCA I should get two strings: ATGAACATAGGACATGAGGAGTCA (the whole thing) and ATGAGGAGTCA (the first match of "ATG" onward). A string that contains "ATG" n times should result in n results.

I thought the expression /(?:[ACGT]*)(ATG)[ACGT]*/g would work, but it doesn't. If this can't be done with a regexp it's easy enough to just write out the code for, but I always prefer an elegant solution if one is available.

+2  A: 

If you really want to use regular expressions, try this:

var str = "ATGAACATAGGACATGAGGAGTCA",
    re = /ATG.*/g, match, matches=[];
while ((match = re.exec(str)) !== null) {
    matches.push(match);
    re.lastIndex = match.index + 3;
}

But be careful with exec and changing the index. You can easily make it an infinite loop.

Otherwise you could use indexOf to find the indices and substr to get the substrings:

var str = "ATGAACATAGGACATGAGGAGTCA",
    offset=0, match=str, matches=[];
while ((offset = match.indexOf("ATG", offset)) > -1) {
    match = match.substr(offset);
    matches.push(match);
    offset += 3;
}
Gumbo
+1 - Darnitall! I was in the middle of my post, explaining the behaviour of match and exec as defined in the ECMAScript 3rd edition (which defines that the next iteration carries on from *endIndex+1*, not *lastIndex+1*). I'd got that part down and decided to test my solution, similar to your first, in the developer tools console. Didn't realize I had an error in my code, got stuck in an infinite loop and had to close the window. Came back and you'd added the solution to your post already lol. The line *"You can easily make it an infinite loop."* is just teasing me now!
Andy E
Yeah, I was going to use a solution similar to the first one there, I was just curious if there was a more elegant solution. Thanks!
TEmerson
+1  A: 

I think you want is

var subStrings = inputString.split('ATG');

KISS :)

jasongetsdown
But that leaves out the 'ATG' sections which the user wants to include in his result.
MvanGeest
Every result but the first one will naturally have been preceded by 'ATG'. Concat them back on if you need them.
jasongetsdown
From the post: 'I have to start a new string at each occurrence of the string "ATG", but I don't want to actually stop the first match at the "ATG".'Sure, I could do a series of concats after doing a split, but that's not what I was looking for. Thanks for the input though.
TEmerson
+1  A: 

Splitting a string before each occurrence of ATG is simple, just use

result = subject.split(/(?=ATG)/i);

(?=ATG) is a positive lookahead assertion, meaning "Assert that you can match ATG starting at the current position in the string".

This will split GGGATGTTTATGGGGATGCCC into GGG, ATGTTT, ATGGGG and ATGCCC.

So now you have an array of (in this case four) strings. I would now go and take those, discard the first one (this one will never contain nor start with ATG) and then join the strings no. 2 + ... + n, then 3 + ... + n etc. until you have exhausted the list.

Of course, this regex doesn't do any validation as to whether the string only contains ACGT characters as it only matches positions between characters, so that should be done before, i. e. that the input string matches /^[ACGT]*$/i.

Tim Pietzcker
A: 

Since you want to capture from every "ATG" to the end split isn't right for you. You can, however, use replace, and abuse the callback function:

var matches = [];
seq.replace(/atg/gi, function(m, pos){ matches.push(seq.substr(pos)); });
Kobi
I started with `seq.replace(/atg(?=(.*))/gi, function(g0,g1){ matches.push(g0 + g1); });` - this one abuses the callback function *and* capturing groups within lookaheads. Too much.
Kobi
A: 

This isn't with regex, and I don't know if this is what you consider "elegant," but...

var sequence = 'ATGAACATAGGACATGAGGAGTCA';
var matches = [];
do {
    matches.push('ATG' + (sequence = sequence.slice(sequence.indexOf('ATG') + 3)));
} while (sequence.indexOf('ATG') > 0);

I'm not completely sure if this is what you're looking for. For example, with an input string of ATGabcdefghijATGklmnoATGpqrs, this returns ATGabcdefghijATGklmnoATGpqrs, ATGklmnoATGpqrs, and ATGpqrs.

Casey Hope