views:

215

answers:

3

Hi,

I am trying to implement protein pairwise sequence alignment using "Global Alignment" algorithm by 'Needleman -Wunsch'.

I am not clear about how to include 'Blosum62 Matrix' in my source code to do the scoring or to fill the two-dimensional matrix?

I have googled and found that most people suggested to use flat file which contains the standard 'Blosum62 Matrix'. Does it mean that I need to read from this flat file and fill my coded "Blosum62 Martrix' ?

Also, the other approach could be is to use some mathematical formula and include it in your programming logic to construct 'Blosum62 Matrix'. But not very sure about this option.

Any ideas or insights are appreciated.

Thanks.

+2  A: 

It would help to know what language you're working in so we can help you with the correct terms, but what I did was use a map of maps (or dictionaries if you're using Python).

Here's an example of my code in Groovy, but it's fairly portable to other languages:

def blosum62 = [
Cys:[Cys:9, Ser:-1, Thr:-1, Pro:-3, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-4, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:-2, Tyr:-2, Trp:-2],
Ser:[Cys:-1,Ser:4,  Thr:1,  Pro:-1, Ala:1,  Gly:0,  Asn:1,  Asp:0,  Glu:0,  Gln:0,  His:-1, Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Thr:[Cys:-1,Ser:1,  Thr:4,  Pro:1,  Ala:-1, Gly:1,  Asn:0,  Asp:1,  Glu:0,  Gln:0,  His:0,  Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Pro:[Cys:-3,Ser:-1, Thr:1,  Pro:7,  Ala:-1, Gly:-2, Asn:-1, Asp:-1, Glu:-1, Gln:-1, His:-2, Arg:-2, Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-4, Tyr:-3, Trp:-4],
Ala:[Cys:0, Ser:1,  Thr:-1, Pro:-1, Ala:4,  Gly:0,  Asn:-1, Asp:-2, Glu:-1, Gln:-1, His:-2, Arg:-1, Lys:-1, Met:-1, Ile:-1, Leu:-1, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Gly:[Cys:-3,Ser:0,  Thr:1,  Pro:-2, Ala:0,  Gly:6,  Asn:-2, Asp:-1, Glu:-2, Gln:-2, His:-2, Arg:-2, Lys:-2, Met:-3, Ile:-4, Leu:-4, Val:0,  Phe:-3, Tyr:-3, Trp:-2],
Asn:[Cys:-3,Ser:1,  Thr:0,  Pro:-2, Ala:-2, Gly:0,  Asn:6,  Asp:1,  Glu:0,  Gln:0,  His:-1, Arg:0,  Lys:0,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-4],
Asp:[Cys:-3,Ser:0,  Thr:1,  Pro:-1, Ala:-2, Gly:-1, Asn:1,  Asp:6,  Glu:2,  Gln:0,  His:-1, Arg:-2, Lys:-1, Met:-3, Ile:-3, Leu:-4, Val:-3, Phe:-3, Tyr:-3, Trp:-4],
Glu:[Cys:-4,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:2,  Glu:5,  Gln:2,  His:0,  Arg:0,  Lys:1,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Gln:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:0,  Glu:2,  Gln:5,  His:0,  Arg:1,  Lys:1,  Met:0,  Ile:-3, Leu:-2, Val:-2, Phe:-3, Tyr:-1, Trp:-2],
His:[Cys:-3,Ser:-1, Thr:0,  Pro:-2, Ala:-2, Gly:-2, Asn:1,  Asp:1,  Glu:0,  Gln:0,  His:8,  Arg:0,  Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-1, Tyr:2,  Trp:-2],
Arg:[Cys:-3,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-2, Asn:0,  Asp:-2, Glu:0,  Gln:1,  His:0,  Arg:5,  Lys:2,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Lys:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:-1, Glu:1,  Gln:1,  His:-1, Arg:2,  Lys:5,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Met:[Cys:-1,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:0,  His:-2, Arg:-1, Lys:-1, Met:5,  Ile:1,  Leu:2,  Val:-2, Phe:0,  Tyr:-1, Trp:-1],
Ile:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:1,  Ile:4,  Leu:2,  Val:1,  Phe:0,  Tyr:-1, Trp:-3],
Leu:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-4, Glu:-3, Gln:-2, His:-3, Arg:-2, Lys:-2, Met:2,  Ile:2,  Leu:4,  Val:3,  Phe:0,  Tyr:-1, Trp:-2],
Val:[Cys:-1,Ser:-2, Thr:-2, Pro:-2, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-2, Gln:-2, His:-3, Arg:-3, Lys:-2, Met:1,  Ile:3,  Leu:1,  Val:4,  Phe:-1, Tyr:-1, Trp:-3],
Phe:[Cys:-2,Ser:-2, Thr:-2, Pro:-4, Ala:-2, Gly:-3, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-1, Arg:-3, Lys:-3, Met:0,  Ile:0,  Leu:0,  Val:-1, Phe:6,  Tyr:3,  Trp:1],
Tyr:[Cys:-2,Ser:-2, Thr:-2, Pro:-3, Ala:-2, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:-1, His:2,  Arg:-2, Lys:-2, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:3,  Tyr:7,  Trp:2],
Trp:[Cys:-2,Ser:-3, Thr:-3, Pro:-4, Ala:-3, Gly:-2, Asn:-4, Asp:-4, Glu:-3, Gln:-2, His:-2, Arg:-3, Lys:-3, Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:1,  Tyr:2,  Trp:11]
]

Using this you can just call

def score = blosum62[Arg][Cys]
println("Substituting Arg by Cys gives " + score)
Tim
Thanks Tim.I am using VB.NET. I have no idea about Groovy but I will look at your sample and try to figure out your appraoch. When I did Nucleotid pairwise alignment using "Local Alignment" algorithm, I created a customized data structure.I will get back here soon.
anshu
Not sure if VB.Net allows nested Dictionaries, but if it does is should be something like `Dim dictionary As New Dictionary(Of String, Dictionary(Of String, Integer))`.
Tim
Thanks for youre reply.Let me find out if nested dictionaries are allowed in VB.Net. My guess is that it should be. Also, did you use any pesudo algorithm to do the protein pairwise alignment? I tired to find online the basic steps of the alogrithm but no luck so I am planning to do the same steps as I did for the pairwise alignment of Nucleotides.
anshu
Sorry but I haven't implemented the actual Needleman - Wunsch algorithm, I just needed the matrix for a single comparison. I would not recommend implementing the algorithm for any reason other than to learn from it, as there are already numerous working and tested implementations that will out perform anything you might implement yourself.
Tim
Thanks Tim for your help.
anshu
+1  A: 

You can always download the matrix from NCBI web site:

ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

Other matrices are also available from the parent directory.

I never saw implementation of Needleman-Wunsch with matrix calculation. It's much easier just to include the matrix in the code or as a separate file.

You can find some details how BLOSUM matrices are calculated for example here: http://en.wikipedia.org/wiki/BLOSUM.

yuk
A: 

You can't infer a blosum matrix from another as you can do for PAM ones: all the blosum are calculated from a different set of data and are not correlated within theirselves.

For example, a PAM250 Matrix is just a PAM1 matrix multiplied 250 times by itself; but this is not true for BLOSUMs, and you can't infer BLOSUM80 from BLOSUM64, for example.

dalloliogm