Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.
- Pool: ca. 300 sequence fragments
- 8 - 20 letters per fragment
- 4 possible letters: a,g,t,c
- each fragment is structured in three regions:
- 5 generic letters
- 8 or more positions of g's and c's
- 5 generic letters
(As regex that would be[gcta]{5}[gc]{8,}[gcta]{5}
)
Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.
Questions:
- Are my fragments too short, and would it help to increase their size?
- Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
- Which alternative methods or tools can you suggest for this task?
Best regards,
Simon