views:

409

answers:

4

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.

I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}

Running Rprof gives me the following output:

> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")

                   self.time self.pct total.time total.pct
"gcCount"          77.36     76.8     100.74     100.0
"=="               18.30     18.2      18.30      18.2
"strsplit"          3.58      3.6       3.64       3.6
"+"                 1.14      1.1       1.14       1.1
":"                 0.30      0.3       0.30       0.3
"as.logical"        0.04      0.0       0.04       0.0
"as.character"      0.02      0.0       0.02       0.0

$by.total
               total.time total.pct self.time self.pct
"gcCount"          100.74     100.0     77.36     76.8
"=="                18.30      18.2     18.30     18.2
"strsplit"           3.64       3.6      3.58      3.6
"+"                  1.14       1.1      1.14      1.1
":"                  0.30       0.3      0.30      0.3
"as.logical"         0.04       0.0      0.04      0.0
"as.character"       0.02       0.0      0.02      0.0

$sampling.time
[1] 100.74

Any advice for making this code faster?

+2  A: 

There's no need to use a loop here.

Try this:

gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]][st:sp]
  length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Shane
Voted up.Better than my answer (it would have been: do it in BioPerl :-) )
Thrawn
Thanks a lot. This is roughly 4x faster, which just barely beats the function I built around Rajarshi's code. You can tell that I'm still learning R - it's hard to break out of that loop-centric thinking that I've been using for so many years.
chrisamiller
Another thing that you might try: `tolower(chars) %in% c("g", "c")`. Not sure which is faster, although I suspect that the OR `|` operator is faster than `%in%`.
Shane
+1  A: 

A one liner:

table(strsplit(toupper(a), '')[[1]])
rguha
+7  A: 

Better to not split at all, just count the matches:

gcCount2 <-  function(line, st, sp){
  length(gregexpr('[GCgc]', substr(line, st, sp))[[1]])
}

That's an order of magnitude faster.

A small C function that just iterates over the characters would be yet another order of magnitude faster.

Ken Williams
Even better (~7x faster). Thanks!
chrisamiller
+2  A: 

I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).

GC content would be:

mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
    GC(mysequence)  # 0.4761905

That's from a string, you can also read in a fasta file using "read.fasta()".

neilfws