tags:

views:

115

answers:

3

Possible Duplicate:
Quickly reading very large tables as dataframes in R

Hi,

trying to read a large dataset in R the console displayed the follwing errors:

data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE)
> data = data[complete.cases(data),]
> dataset<-data.frame(user_id=as.character(data[,1]),event_date= as.character(data[,2]),day_of_week=as.factor(data[,3]),distinct_events_a_count=as.numeric(as.character(data[,4])),total_events_a_count=as.numeric(as.character(data[,5])),events_a_duration=as.numeric(as.character(data[,6])),distinct_events_b_count=as.numeric(as.character(data[,7])),total_events_b=as.numeric(as.character(data[,8])),events_b_duration= as.numeric(as.character(data[,9])))
Error: cannot allocate vector of size 94.3 Mb
In addition: Warning messages:
1: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[,  :
  NAs introduced by coercion
2: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[,  :
  NAs introduced by coercion
3: In class(value) <- "data.frame" :
  Reached total allocation of 3583Mb: see help(memory.size)
4: In class(value) <- "data.frame" :
  Reached total allocation of 3583Mb: see help(memory.size)

Does anyone know how to read large datasets? The size of UserDailyStats.csv is approximately 2GB.

+5  A: 

Sure:

  1. Get a bigger computer, in particular more ram
  2. Run a 64-bit OS, see 1) about more ram now that you can use it
  3. Read only the columns you need
  4. Read fewer rows
  5. Read the data in binary rather than re-parsing 2gb (which is mighty inefficient).

There is also a manual for this at the R site.

Dirk Eddelbuettel
+1  A: 

You could try specifying the data type in the read.csv call using colClasses.

data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE, colClasses=c("character","character","factor",rep("numeric",6)))

Though with a dataset of this size it may still be problematic and there isn't a great deal of memory left for any analysis you may want to do. Adding RAM & using 64-bit computing would provide more flexibility.

James
+1  A: 

If this is output from console then you read data, but there is problem with transformations.

If you work interactively then after read.csv save your data with save(data, file="data.RData"), close R, run fresh instance, load data with load("data.RData"), and see if it fail.

But from this error messages I see that you have problem with conversion so you should look at that.

Marek