ansaurus

Question

Quickly reading very large tables as dataframes in R

Answer 1

+2 A:

A different approach to the problem could be to use the sqldf package. See in particular the Example 6. File input.

HIH

Paolo 2009-11-13 10:25:04

On the right track, but JD Long's answer below provides the code to do it.

dataspora 2010-01-23 01:28:24

Answer 2

+9 A:

There are a couple of simple things to try, whether you use read.table or scan.

Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save, then next time you can retrieve it faster with load.

Richie Cotton 2009-11-13 10:35:05

Thanks for the tips Richie. I did a little testing, and it seems that the performance gains with using the nrow and colClasses options for read.table are quite modest. For example, reading a ~7M row table takes 78s without the options, and 67s with the options. (note: the table has 1 character column, 4 integer columns, and I read using comment.char='' and stringsAsFactors=FALSE).Using save() and load() when possible is a great tip - once stored with save(), that same table takes only 12s to load.

eytan 2009-11-14 06:00:59

Answer 3

+4 A:

This was previously asked on R-Help, so that's worth reviewing.

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

Shane 2009-11-13 15:18:57

I just did a quick test and readChar does seem to be much faster than even readLines for some inexplicable reason. However, it is still slow as sin compared to a simple C test. At the simple task of reading 100 megs, R is about 5 - 10x slower than C

Jonathan Chang 2009-11-13 20:12:53

Bringing Hadoop to bear here is like bringing a cannon to a knife fight.

dataspora 2010-01-23 01:27:18

Don't understand your point. The point of Hadoop is to handle very large data, which is what the question was about.

Shane 2010-01-23 01:33:45

Answer 4

+14 A:

I didn't see the question initially and asked a similiar question a few days later. I am going to take my previous question down, but thought I'd add to this thread how I used sqldf() to do this.

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into sqlite as a staging area, and then sucking it from sqlite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Here's my test code:

set up the test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

I restarted R before running the following import routine:

library(sqldf)
f <- file(”bigdf.csv”)
system.time(bigdf <- sqldf(”select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

I let the following line run all night but it never completed:

system.time(big.df <- read.csv(’bigdf.csv’))

JD Long 2009-11-30 15:48:11

JD, you deserve a medal for this answer. Too many folks have banged their heads against this issue for too long.

dataspora 2010-01-23 01:30:54

ansaurus

tags:

views:

answers:

Quickly reading very large tables as dataframes in R

related questions