to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.

  • Pool: ca. 300 sequence fragments
  • 8 - 20 letters per fragment
  • 4 possible letters: a,g,t,c
  • each fragment is structured in three regions:
    1. 5 generic letters
    2. 8 or more positions of g's and c's
    3. 5 generic letters
      (As regex that would be [gcta]{5}[gc]{8,}[gcta]{5})

to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.


  1. Are my fragments too short, and would it help to increase their size?
  2. Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
  3. Which alternative methods or tools can you suggest for this task?

Best regards,


+1  A: 

Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.

+1  A: 

Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?

You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.

For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:

  1. Make a database entry for each G/C 8-mer (2^8 = 256 in all).
  2. Take each GC-region and walk it to see which 8-mers it contains.
  3. Tag each GC-region with the sequences it contains.

Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.

Ron Gejman
that sounds like an approach I should try :)
What exactly are you trying to find out?
Ron Gejman