tags:

views:

117

answers:

4

I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:

probe2gene <- list()
for (i in 1:length(probes)){
 strings<-strsplit(descriptions[i]), '//')
 if (length(strings[[1]]) > 1){ 
  probe2gene[probes[i]] = strings[[1]][2]
 }
}

Which works fine, but seems slow, much slower than the roughly equivalent python:

probe2gene = {}
for p,d in zip(probes, descriptions):
    try:
     probe2gene[p] = descriptions.split('//')[1]
    except IndexError:
     pass

My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops are rare. Is there a better solution?

Edit: a typical good "description" looks like this:

"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"

a bad "description: looks like this

"-----"

though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].

+2  A: 

It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.

> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"

Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.

Johann Hibschman
Thanks! That is a lot faster. Had realised things like "strsplit" could be applied to whole vectors. Neat!
Mike Dewar
+1  A: 

If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?

Probe and Description are the same length?

This is kind of messy but a quick first pass at it?

a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))

names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only

That's my first attempt. If you have a sample dataset that would be very useful.

Best regards,

Jay

Jay
It's hard to be the first responder ;)
Jay
A: 

Another way.

probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA      NA      "hair" 

In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.

Jyotirmoy Bhattacharya
A: 

Here's another approach that should be fast. Note that this doesn't remove the empty descriptions. It could be adapted to do that or you could clean those in a post processing step using lapply. Is it the case that you'll never have a valid description of length one?

make_desc <- function(n)
{
    word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
    if (runif(1) < 0.70)
        paste(sapply(seq_len(n), word), collapse = "//")
    else
        "----"
}

description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))

desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)


> description
 [1] "fmbec"                                                               
 [2] "----"                                                                
 [3] "----"                                                                
 [4] "frrii//yjxsa//wvkce//xbpkc"                                          
 [5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"                                   
 [6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"                            
 [7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"                     
 [8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"              
 [9] "----"                                                                
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"


> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"
seth