Hi all,
Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:
> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), st.num=c("101", "102", "105", "102", "150"), st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
name st.num st.name
1 Anne 101 Main
2 Bob 102 Elm
3 Chris 105 Park
4 Dan 102 Elm
5 Erin 150 Main
>
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df, c("st.num", "st.name"), function(x) transform(x, household=getString()))
> df
name st.num st.name household
1 Anne 101 Main 1EZWm4BQel
2 Bob 102 Elm xNaeuo50NS
3 Dan 102 Elm xNaeuo50NS
4 Chris 105 Park Ju1NZfWlva
5 Erin 150 Main G2gKAMZ1cU
While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.
Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?
Thanks in advance for your help.