tags:

views:

639

answers:

2

I am trying to merge several data.frames into one data.frame. Since I have a whole list of files I am trying to do it with a loop structure.

So far the loop approach works fine. However, it looks pretty inefficient and I am wondering if there is a faster and easier approach.

Here is the scenario: I have a directory with several .csv files. Each file contains the same identifier which can be used as the merger variable. Since the files are rather large in size I thought to read each file one at a time into R instead of reading all files at once. So I get all the files of the directory with list.files and read in the first two files. Afterwards I use merge to get one data.frame.

FileNames <- list.files(path=".../tempDataFolder/")
FirstFile <- read.csv(file=paste(".../tempDataFolder/", FileNames[1], sep=""),
             header=T, na.strings="NULL")
SecondFile <- read.csv(file=paste(".../tempDataFolder/", FileNames[2], sep=""),
              header=T, na.strings="NULL")
dataMerge <- merge(FirstFile, SecondFile, by=c("COUNTRYNAME", "COUNTRYCODE", "Year"),
             all=T)

Now I use a for loop to get all the remaining .csv files and merge them into the already existing data.frame:

for(i in 3:length(FileNames)){ 
ReadInMerge <- read.csv(file=paste(".../tempDataFolder/", FileNames[i], sep=""),
               header=T, na.strings="NULL")
dataMerge <- merge(dataMerge, ReadInMerge, by=c("COUNTRYNAME", "COUNTRYCODE", "Year"),
             all=T)
}

Even though it works just fine I was wondering if there is a more elegant way to get the job done?

+12  A: 

You may want to look at the closely related question on stackoverflow.

I would approach this in two steps: import all the data (with plyr), then merge it together:

filenames <- list.files(path=".../tempDataFolder/", full.names=TRUE)
library(plyr)
import.list <- llply(filenames, read.csv)

That will give you a list of all the files that you now need to merge together. There are many ways to do this, but here's one approach (with Reduce):

data <- Reduce(function(x, y) merge(x, y, all=T, 
    by=c("COUNTRYNAME", "COUNTRYCODE", "Year")), import.list, accumulate=F)

Alternatively, you can do this with the reshape package if you aren't comfortable with Reduce:

library(reshape)
data <- merge_recurse(import.list)
Shane
@shane: i like the approach using the `reshape` package. i have to take another look into `plyr` and `reshape`. thanks! one small thing, in the first line of code, `full.names=TRUE` has to be added.
mropa
Thanks; corrected that.
Shane
+1  A: 

If I'm not mistaken, a pretty simple change could eliminate the 3:length(FileNames) kludge:

FileNames <- list.files(path=".../tempDataFolder/", full.names=TRUE)
dataMerge <- data.frame()
for(f in FileNames){ 
  ReadInMerge <- read.csv(file=f, header=T, na.strings="NULL")
  dataMerge <- merge(dataMerge, ReadInMerge, 
               by=c("COUNTRYNAME", "COUNTRYCODE", "Year"), all=T)
}
Ken Williams
@ken: since the `dataMerge` is an empty `data.frame` the `merge()` function cannot find an common identifier in the first `for` loop. if i assign eg the first file to `dataMerge` it kind of gets me back to my initial idea.
mropa
Sorry, I should have tried it first. I was thinking of rbind(), in which an empty data frame is treated as if the required columns are present but empty.
Ken Williams