tags:

views:

105

answers:

3

I'm trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.

The function works if I just call it with test numbers but I can't figure out how to run it on all rows of my dataset. I've tried using apply but can't get it working.

I'm going to be iterating different parameter settings to find those that best fit historical data so speed is important... meaning that a loop is out. Any help you can provide would be hugely appreciated.

Thanks! Alan

GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
    HeightParam <- 1/5000
        AgeParam <- 1
    Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]

    Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2

    return(sqrt(sum(Stock_SameType$ED)))

}

HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")

GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number

res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
        # returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852,  :  replacement has 4060 rows, data has 54
    # also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) :  longer object length is not a multiple of shorter object length

res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
    # `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261,  : replacement has 4060 rows, data has 13
    # also same 4 warnings as res1
+2  A: 

I took some liberties with your code b/c I try to vectorize vice use loops whenever I can... With the merge function, you merge the two data frames, and operate on the "columns", which allows you to use the vectorization built into R. I think this will do what you want (in the second line I'm just making sure that A and B don't have the same values for height and age so that your distance isn't always zero):

A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
richardh
Thanks. I converted to use vectorization as suggested and it's much quicker than mapply (by a factor of 16). Plus when I'm iterating to find the right parameters, it makes more sense to do the merge once. I had to edit your code a bit and use ddply a the end to group the merged data frame.I'm building an algorithm to select the best stock to buy. This part is looking at the similarity of potential purchases to existing stock. The dates are in there because I'm running against historical data to calibrate the parameters.
alan
+1  A: 

Use mapply, the multivariate form of apply:

res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)
goodside
FYI, `mapply` is still a loop, just a more compact and readable one.
richardh
thanks mapply is what I was looking for, but as richardh suggested, vectorization turned out to be faster
alan
A: 

Code as per above comment:

A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))



AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))
alan