views:

1082

answers:

4

Hello,

I have very large tables that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? ... Or quite possibly completely different approach to the problem?

Thanks,

-e

+2  A: 

A different approach to the problem could be to use the sqldf package. See in particular the Example 6. File input.

HIH

Paolo
On the right track, but JD Long's answer below provides the code to do it.
dataspora
+9  A: 

There are a couple of simple things to try, whether you use read.table or scan.

  1. Set nrows=the number of records in your data (nmax in scan).

  2. Make sure that comment.char="" to turn off interpretation of comments.

  3. Explicitly define the classes of each column using colClasses in read.table.

  4. Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save, then next time you can retrieve it faster with load.

Richie Cotton
Thanks for the tips Richie. I did a little testing, and it seems that the performance gains with using the nrow and colClasses options for read.table are quite modest. For example, reading a ~7M row table takes 78s without the options, and 67s with the options. (note: the table has 1 character column, 4 integer columns, and I read using comment.char='' and stringsAsFactors=FALSE).Using save() and load() when possible is a great tip - once stored with save(), that same table takes only 12s to load.
eytan
+4  A: 

This was previously asked on R-Help, so that's worth reviewing.

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

Shane
I just did a quick test and readChar does seem to be much faster than even readLines for some inexplicable reason. However, it is still slow as sin compared to a simple C test. At the simple task of reading 100 megs, R is about 5 - 10x slower than C
Jonathan Chang
Bringing Hadoop to bear here is like bringing a cannon to a knife fight.
dataspora
Don't understand your point. The point of Hadoop is to handle very large data, which is what the question was about.
Shane
+14  A: 

I didn't see the question initially and asked a similiar question a few days later. I am going to take my previous question down, but thought I'd add to this thread how I used sqldf() to do this.

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into sqlite as a staging area, and then sucking it from sqlite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Here's my test code:

set up the test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

I restarted R before running the following import routine:

library(sqldf)
f <- file(”bigdf.csv”)
system.time(bigdf <- sqldf(”select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

I let the following line run all night but it never completed:

system.time(big.df <- read.csv(’bigdf.csv’))
JD Long
JD, you deserve a medal for this answer. Too many folks have banged their heads against this issue for too long.
dataspora
See also `read.csv.sql`
James