views:

208

answers:

1

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?

For example, to count the lengths of words in a character vector:

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

+7  A: 

In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

> nchar(words)
[1] 1 5 5 3

More generally, take advantage of the fact that strsplit returns a list and use lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

Or else use an l*ply family function from plyr. For instance:

> laply(strsplit(words,""), length)
[1] 1 5 5 3

Edit:

In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

Now that I have all the words, we can do our counts:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03 

The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

Apparently the latest version of plyr is faster (this is using a slightly older version).

Shane
Thanks Shane, but I'm not getting the same results from what I'm doing. Its an implementation of the Verhoeff check digit scheme. I've modified my function to be compatible with the above implementations, but with an input of a 100,000 long vector, I'm getting a list of 8 elements from the first and a vector of 8 elements from the second (8 is the most likely length of the vector elements).
James
@James: Then I would imagine that there must be something else going on with your function. As you can see above, I just tested this on a vector with over 270k records and got the same results from each. You might try providing more of your code or else providing some of your data.
Shane
Incidentally, I just installed plyr version 0.1.9 in R 2.11.1 and had similar timings as in the above.
Shane
@Shane: Yes, I mistakenly indexed the list when I called it. It works now, but the timings for lapply are not much better than sapply. The algorithm needs to work through the split digits in order, so maybe that is causing the problem.
James
Marking the answer correct as it fits the example perfectly.
James
@James: Yes, you shouldn't expect much of a performance difference between any of the `apply` functions or `for`, etc. (see this question for example: http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar/2276001#2276001). The real performance improvement would come from changing your approach from iteration to a vectorized function (as in the nchar example). Feel free to post the algorithm as a separate question for optimization.
Shane
@Shane: That's not entirely correct. lapply an sapply can actually be optimized in some cases. apply() is generally fast. And, if you use sapply like lapply is being used here you can get the performance much closer. I timed a 'for' loop for this and it's close to the sapply used here but if you rewrite the sapply function like the current lapply function the sapply is twice as fast as 'for'. (all of the plyr routines are incomprehensibly slower than even the for loop)
John
@John: I might look at the `for` example because in my experience the performance is pretty close. But my message is simple: using various different `apply` functions, etc. can have a marginal gain in performance, but major improvements can be had through vectorization.
Shane
The plyr slowness is fixed in the devel version - but plyr is generally more useful when dealing with more complex problems where the times of individual applications dominates.
hadley