views:

61

answers:

4

I have a space delimited tabular file that looks like this:

>NODE 28 length 23 cov 11.043478 ACATCCCGTTACGGTGAGCCGAAAGACCTTATGTATTTTGTGG
>NODE 32 length 21 cov 13.857142 ACAGATGTCATGAAGAGGGCATAGGCGTTATCCTTGACTGG
>NODE 33 length 28 cov 14.035714 TAGGCGTTATCCTTGACTGGGTTCCTGCCCACTTCCCGAAGGACGCAC

How can I use Unix sort to sort it by length of DNA sequence [ATCG]?

+5  A: 

If the length is in the 4th column, sort -n -k4 should do the trick.

If the answer needs to figure out the length, then you're looking for a preprocessing step before sort. Perhaps python that just prints out the length of the 7th space separated column as a last or first column.

Slartibartfast
+2  A: 

This pipelined Command will figure out the length also.My Unix is a bit rusty have been doing other things for a while

$ awk '{printf("%d %s\n", length($NF), $0)}' junk.lst|sort -n -k1,1|sed 's/^[0-9]* //'
josephj1989
Wow, that's like using a nuclear warhead to kill a fly :-)
paxdiablo
+1  A: 
 awk '{print length($NF) $0|"sort -n"}' file | sed 's/^.[^>]*>/>/'
ghostdog74
+1  A: 

With Perl:

perl -e'
  print sort {
    length +($a =~ /(\S+)$/)[0] 
      <=>
    length +($b =~ /(\S+)$/)[0]
  } <>' infile

With GNU awk:

WHINY_USERS= gawk 'END { 
  for (L in l) print l[L]
  }
{ 
  l[sprintf("%15s", length($NF))] = $0 
  }' infile
radoulov