Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P.
Example:
Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR
should result in these 3 sequences (peptides):
VGTK
CCTKPESER
MPCTEDYLSLILNR
Note that there is no cleavage after K in the second peptide (because P comes after K).
In Perl (it could just as well have been in C#, Python or Ruby):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
my @peptides = split /someRegularExpression/, $seq;
I have used this work-around (where a cut marker, =, is first inserted in the sequence and removed again if P is immediately after the cut maker):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
$seq =~ s/([RK])/$1=/g; #Main cut rule.
$seq =~ s/=P/P/g; #The exception.
my @peptides = split( /=/, $seq);
But this requires modification to a string that can potentially be very long and there can be millions of sequences. Is there a way where a regular expression can be used with split? If yes, what would the regular expression be?
Test platform: Windows XP 64 bit. ActivePerl 64 bit. From perl -v: v5.10.0 built for MSWin32-x64-multi-thread.