tags:

views:

92

answers:

1

I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input file is 100G in size?

The input files looks like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1

var1
1
2
...

File2

var2?
2
5
...

File3

var3
hello
yay

I got a solution in Python (http://stackoverflow.com/questions/3331608/how-to-break-a-large-csv-data-file-into-individual-data-files) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!

+5  A: 

You can scan and then write to a file(s) one line at a time.

i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
       length(x) > 1}) {
  write(x[1], "file1.csv", sep = ",", append = T)
  write(x[2], "file2.csv", sep = ",", append = T)
  write(x[3], "file3.csv", sep = ",", append = T)
  i <- i + 1
}

edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.

ver1 <- function() {
  i <- 0
  while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], "file1.csv", sep = ",", append = T)
    write(x[2], "file2.csv", sep = ",", append = T)
    write(x[3], "file3.csv", sep = ",", append = T)
    i <- i + 1
  }
}

system.time(ver1()) # w/ close to 3K lines of data, 3 columns
##    user  system elapsed 
##   2.809   0.417   3.629 

ver2 <- function() {
  f <- file("file.csv", "r")
  f1 <- file("file1.csv", "w")
  f2 <- file("file2.csv", "w")
  f3 <- file("file3.csv", "w")
  while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], file = f1, sep = ",", append = T, ncol = 1)
    write(x[2], file = f2, sep = ",", append = T, ncol = 1)
    write(x[3], file = f3, sep = ",", append = T, ncol = 1)
  } 
  closeAllConnections()
}

system.time(ver2())
##   user  system elapsed 
##   0.257   0.098   0.409 
apeescape
Thanks. I will look into scan and write.
xiaodai
This one is ok. But I found it to be extremely slow. The Python example code opens up the files and then transverses through it. I think in this code scan opens the file goes to the read location, reads the data, then closes the file; then it repeats. Hence the slowness. Can R open a file like Python, keeps it open and traverses through it? I don't think scan is doing it.
xiaodai
right, i was thinking the same thing. this link may help: http://cran.r-project.org/doc/manuals/R-data.html#Output-to-connections
apeescape
I think it might go faster if you read bigger chunks at once. Try changing `nlines = 1` to `nlines = 1000` or `nlines = 10000`.
nullglob
you're right. just have to be careful of identifying the end of file.
apeescape
I think I did that by storing the entire file into a dummy variable and count the number of rows (remove the variable after this). There may be better ways, though.
Roman Luštrik