ansaurus

Question

Answer 1

+7 A:

In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

> nchar(words)
[1] 1 5 5 3

More generally, take advantage of the fact that strsplit returns a list and use lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

Or else use an l*ply family function from plyr. For instance:

> laply(strsplit(words,""), length)
[1] 1 5 5 3

Edit:

In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

Now that I have all the words, we can do our counts:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03

The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

Apparently the latest version of plyr is faster (this is using a slightly older version).

Shane 2010-06-16 15:19:32

Thanks Shane, but I'm not getting the same results from what I'm doing. Its an implementation of the Verhoeff check digit scheme. I've modified my function to be compatible with the above implementations, but with an input of a 100,000 long vector, I'm getting a list of 8 elements from the first and a vector of 8 elements from the second (8 is the most likely length of the vector elements).

James 2010-06-16 15:50:53

@James: Then I would imagine that there must be something else going on with your function. As you can see above, I just tested this on a vector with over 270k records and got the same results from each. You might try providing more of your code or else providing some of your data.

Shane 2010-06-16 15:57:03

Incidentally, I just installed plyr version 0.1.9 in R 2.11.1 and had similar timings as in the above.

Shane 2010-06-16 15:57:53

@Shane: Yes, I mistakenly indexed the list when I called it. It works now, but the timings for lapply are not much better than sapply. The algorithm needs to work through the split digits in order, so maybe that is causing the problem.

James 2010-06-16 16:08:08

Marking the answer correct as it fits the example perfectly.

James 2010-06-16 16:09:34

@James: Yes, you shouldn't expect much of a performance difference between any of the `apply` functions or `for`, etc. (see this question for example: http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar/2276001#2276001). The real performance improvement would come from changing your approach from iteration to a vectorized function (as in the nchar example). Feel free to post the algorithm as a separate question for optimization.

Shane 2010-06-16 16:10:47

@Shane: That's not entirely correct. lapply an sapply can actually be optimized in some cases. apply() is generally fast. And, if you use sapply like lapply is being used here you can get the performance much closer. I timed a 'for' loop for this and it's close to the sapply used here but if you rewrite the sapply function like the current lapply function the sapply is twice as fast as 'for'. (all of the plyr routines are incomprehensibly slower than even the for loop)

John 2010-06-16 18:06:07

@John: I might look at the `for` example because in my experience the performance is pretty close. But my message is simple: using various different `apply` functions, etc. can have a marginal gain in performance, but major improvements can be had through vectorization.

Shane 2010-06-16 18:13:37

The plyr slowness is fixed in the devel version - but plyr is generally more useful when dealing with more complex problems where the times of individual applications dominates.

hadley 2010-06-16 19:00:11

ansaurus

tags:

views:

answers:

R strsplit and vectorization

related questions