views:

74

answers:

1

Good morning,

My question saga continues about R.

I have been working on large datasets lately (more than 400 thousands lines).

So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.

Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.

It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).

To sum up, I would like to know if:

1) XTS has indeed a size limitation

2) What do you think is the smartest way to handle large time series?

(I was thinking about splitting the analysis into several smaller datasets).

3) I don't get an error message, R simply shuts down automatically. Is this a known behavior?

Thanks for you help,

Jeremie

SOLUTION

  1. The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
  2. Chuncking data is indeed a good idea, but it is not needed.
  3. This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
+5  A: 

Regarding your two questions, my $0.02:

  1. Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.

  2. Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),

There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Dirk Eddelbuettel
Well the thing is that I am applying some indicators on the dataset. Basically I have to find the best parameters for these indicators. So splitting by chunk is not really good for me because it makes my analysis "discontinuous". What I could do maybe is to consider only frame and to move the frame continuously of the data. (like 1...10, 2...11,3....12, and requery the database everytime).
JSmaga
Actually I am noticing that if I do the computation and I don't display the results in the R console, R doesn't crash. It is only problematic when displaying data somehow. Good to know. Any idea why that's the case?
JSmaga
@JSMaga: could be that the `print`/`plot` function can't handle that many lines. You can't read that many anyway, so it's best to summarise it first, and only display what you need.
Richie Cotton
@Richie Cooton: indeed. Even trying to display it in R makes it crash.It should give me an error though. Instead of crashing...
JSmaga
@Richie Cooton: the bug was comming for R itself. It has been solved in the latest version. So `print/plot` do handle that many points...
JSmaga