tags:

views:

85

answers:

1

I am using plyr package in R to do the following:

  • pick up a row from table A according to column A and column B
  • find the row from table B having the same value in column A and column B
  • copy column C from table B to table A

I have made the progress bar to show the progress, but after it shows to 100% it seems to be still running, as I have see my CPU is still occupied by RGUI, but it just doesn't end.

My table A is having about 40000 rows of data with unique column A and column B.

I suspect that the "combine" part of the "split-conquer-combine" workflow in plyr cannot handle this 40000 rows of data, because I can do it for another table with 4000 rows of data.

Any suggestions for improving the efficiency? Thanks.

UPDATE

Here is my code:

for (loop.filename in (1:nrow(filename)))
  {print("infection source merge")
   print(filename[loop.filename, "table_name"])
   temp <- get(filename[loop.filename, "table_name"])
   temp1 <- ddply(temp,
                  c("HOSP_NO", "REF_DATE"),
                  function(df)
                    {temp.infection.source <- abcde[abcde[,"Case_Number"]==unique(df[,"HOSP_NO"]) &
                                              abcde[,"Reference_Date"]==unique(df[,"REF_DATE"]),
                                              "Case_Definition"]
                     if (length(temp.infection.source)==0) {
                         temp.infection.source<-"NIL"
                         } else {
                         if (length(unique(temp.infection.source))>1) {
                             temp.infection.source<-"MULTIPLE"
                             } else {
                            temp.infection.source<-unique(temp.infection.source)}}
                     data.frame(df,
                                INFECTION_SOURCE=temp.infection.source)
                     },
                    .progress="text")
   assign(filename[loop.filename, "table_name"], temp1)
  }
+2  A: 

If I understood correctly what you're trying to achieve, this should do what you want, pretty quick, and without too much memory loss.

#toy data
A <- data.frame(
    A=letters[1:10],
    B=letters[11:20],
    CC=1:10
)

ord <- sample(1:10)
B <- data.frame(
    A=letters[1:10][ord],
    B=letters[11:20][ord],
    CC=(1:10)[ord]
)
#combining values
A.comb <- paste(A$A,A$B,sep="-")
B.comb <- paste(B$A,B$B,sep="-")
#matching
A$DD <- B$CC[match(A.comb,B.comb)]
A

This applies only if the combinations are unique. If they're not, you'll have to take care of that first. Without the data it's quite impossible to know what you're trying to achieve exactly in your complete function, but you should be able to port the logic given here to your own case.

Joris Meys
thanks for the code, I have tried, but sadly, the table B for me is not the same case as yours, my table B have duplicates of A-B with difference C. You can see that I have a series of conditional in the middle of the ddply function, which is to deal with this issue. And seems the match function will only show the first matched item. thanks anyway.
lokheart
I have used your method with some tweaking, I have created an unique table B before matching with table A, and it works! thanks!
lokheart
@lokheart: You could also do something like in this question: http://stackoverflow.com/questions/3990155/r-sort-multiple-columns-by-another-data-frame/3990529#3990529 It's a similar problem, and the solutions there might give you more to work with if you want to tweak it further.
Joris Meys
Better to use merge or join than pasting together strings to use match.
hadley