views:

58

answers:

2

I have written the following function to calculate a check digit in R.

verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)

## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}

#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)

digs <- rev(digs)   ## right to left algorithm

## tables required for D_5 group

d5_mult <- matrix(c(
                 0:9,
                 c(1:4,0,6:9,5),
                 c(2:4,0:1,7:9,5:6),
                 c(3:4,0:2,8:9,5:7),
                 c(4,0:3,9,5:8),
                 c(5,9:6,0,4:1),
                 c(6:5,9:7,1:0,4:2),
                 c(7:5,9:8,2:0,4:3),
                 c(8:5,9,3:0,4),
                 9:0
                 ),10,10,byrow=T)

d5_perm <- matrix(c(
                 0:9,
                 c(1,5,7,6,2,8,3,0,9,4),
                 c(5,8,0,3,7,9,6,1,4,2),
                 c(8,9,1,6,0,4,3,5,2,7),
                 c(9,4,5,3,1,2,6,8,7,0),
                 c(4,2,8,6,5,7,3,9,0,1),
                 c(2,7,9,3,8,0,6,4,1,5),
                 c(7,0,4,6,9,1,3,2,5,8)
                 ),8,10,byrow=T)

d5_inv <- c(0,4:1,5:9)

## apply algoritm - note 1-based indexing in R
d <- 0

for (i in 1:length(digs)){
    d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
    }

d5_inv[d+1]
}

In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.

How could this function be vectorized?

I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?

+1  A: 

If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.

verhoeffCheckNew <- function(x)
{
## calculates check digit based on Verhoeff algorithm

## check for string since leading zeros with numbers will be lost
  if (!is.character(x)) stop("Must enter a string")

  #split and convert to numbers
  digs <- strsplit(x, "")
  digs <- lapply(digs, function(x) rev(as.numeric(x)))

  ## tables required for D_5 group
  d5_mult <- matrix(c(
                   0:9,
                   c(1:4,0,6:9,5),
                   c(2:4,0:1,7:9,5:6),
                   c(3:4,0:2,8:9,5:7),
                   c(4,0:3,9,5:8),
                   c(5,9:6,0,4:1),
                   c(6:5,9:7,1:0,4:2),
                   c(7:5,9:8,2:0,4:3),
                   c(8:5,9,3:0,4),
                   9:0
                   ),10,10,byrow=T)

  d5_perm <- matrix(c(
                   0:9,
                   c(1,5,7,6,2,8,3,0,9,4),
                   c(5,8,0,3,7,9,6,1,4,2),
                   c(8,9,1,6,0,4,3,5,2,7),
                   c(9,4,5,3,1,2,6,8,7,0),
                   c(4,2,8,6,5,7,3,9,0,1),
                   c(2,7,9,3,8,0,6,4,1,5),
                   c(7,0,4,6,9,1,3,2,5,8)
                   ),8,10,byrow=T)

  d5_inv <- c(0,4:1,5:9)

  ## apply algorithm - note 1-based indexing in R      
  sapply(digs, function(x)
  {
    d <- 0  
    for (i in 1:length(x)){
        d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
        }  
    d5_inv[d+1]
  })
}

Since d depends on what it was previously, the is no easy way to vectorise the for loop.

My version runs in about half the time for 1e5 strings.

rand_string <- function(n = 12) 
{
  paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
}
big_test <- replicate(1e5, rand_string())

tic()
res1 <- unname(sapply(big_test, verhoeffCheck))
toc()

tic()
res2 <- verhoeffCheckNew(big_test)
toc()

identical(res1, res2) #hopefully TRUE!

See this question for tic and toc.

Further thoughts:

You may want additional input checking for "" and other strings that return NA when converted in numeric.

Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)

Richie Cotton
Very Nice! I am finding a similar speedup. Wrapping the final `lapply` in an `as.numeric` ensures that a vector is returned rather than a list.
James
@James: Using `sapply` rather than `lapply` will do this for you (without needing `as.numeric`).
Richie Cotton
A: 

Richie C answered the vectorisation question nicely; as for only creatig the tables once without cluttering the global name space, one quick solution that does not require a package is

verhoeffCheck <- local(function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs)   ## right to left algorithm
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
    d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
    }
d5_inv[d+1]
})

assign("d5_mult", matrix(c(
    0:9, c(1:4,0,6:9,5), c(2:4,0:1,7:9,5:6), c(3:4,0:2,8:9,5:7),
    c(4,0:3,9,5:8), c(5,9:6,0,4:1), c(6:5,9:7,1:0,4:2), c(7:5,9:8,2:0,4:3),
    c(8:5,9,3:0,4), 9:0), 10, 10, byrow = TRUE), 
    envir = environment(verhoeffCheck))

assign("d5_perm", matrix(c(
    0:9, c(1,5,7,6,2,8,3,0,9,4), c(5,8,0,3,7,9,6,1,4,2),
    c(8,9,1,6,0,4,3,5,2,7), c(9,4,5,3,1,2,6,8,7,0), c(4,2,8,6,5,7,3,9,0,1),
    c(2,7,9,3,8,0,6,4,1,5), c(7,0,4,6,9,1,3,2,5,8)), 8, 10, byrow = TRUE),
    envir = environment(verhoeffCheck))

assign("d5_inv", c(0,4:1,5:9), envir = environment(verhoeffCheck))
## Now just use the function

which keeps the data in the environment of the function. You can time it to see how much faster it is.

Hope this helps.

Allan

Allan Engelhardt