ansaurus

Question

How to break a large CSV data file into individual data files using R?

Answer 1

+5 A:

You can scan and then write to a file(s) one line at a time.

i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
       length(x) > 1}) {
  write(x[1], "file1.csv", sep = ",", append = T)
  write(x[2], "file2.csv", sep = ",", append = T)
  write(x[3], "file3.csv", sep = ",", append = T)
  i <- i + 1
}

edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.

ver1 <- function() {
  i <- 0
  while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], "file1.csv", sep = ",", append = T)
    write(x[2], "file2.csv", sep = ",", append = T)
    write(x[3], "file3.csv", sep = ",", append = T)
    i <- i + 1
  }
}

system.time(ver1()) # w/ close to 3K lines of data, 3 columns
##    user  system elapsed 
##   2.809   0.417   3.629 

ver2 <- function() {
  f <- file("file.csv", "r")
  f1 <- file("file1.csv", "w")
  f2 <- file("file2.csv", "w")
  f3 <- file("file3.csv", "w")
  while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], file = f1, sep = ",", append = T, ncol = 1)
    write(x[2], file = f2, sep = ",", append = T, ncol = 1)
    write(x[3], file = f3, sep = ",", append = T, ncol = 1)
  } 
  closeAllConnections()
}

system.time(ver2())
##   user  system elapsed 
##   0.257   0.098   0.409

apeescape 2010-07-31 03:31:52

Thanks. I will look into scan and write.

xiaodai 2010-07-31 03:49:28

This one is ok. But I found it to be extremely slow. The Python example code opens up the files and then transverses through it. I think in this code scan opens the file goes to the read location, reads the data, then closes the file; then it repeats. Hence the slowness. Can R open a file like Python, keeps it open and traverses through it? I don't think scan is doing it.

xiaodai 2010-07-31 04:03:49

right, i was thinking the same thing. this link may help: http://cran.r-project.org/doc/manuals/R-data.html#Output-to-connections

apeescape 2010-07-31 04:26:58

I think it might go faster if you read bigger chunks at once. Try changing `nlines = 1` to `nlines = 1000` or `nlines = 10000`.

nullglob 2010-07-31 09:10:06

you're right. just have to be careful of identifying the end of file.

apeescape 2010-07-31 16:39:54

I think I did that by storing the entire file into a dummy variable and count the number of rows (remove the variable after this). There may be better ways, though.

Roman Luštrik 2010-08-02 07:24:32

ansaurus

tags:

views:

answers:

How to break a large CSV data file into individual data files using R?

related questions