views:

262

answers:

4

Hello,

I have cell array each containing a sequence of values as a row vector. The sequences contain some missing values represented by NaN.

I would like to replace all NaNs using some sort of interpolation method, how can I can do this in MATLAB? I am also open to other suggestions on how to deal with these missing values.

Consider this sample data to illustrate the problem:

seq = {randn(1,10); randn(1,7); randn(1,8)};
for i=1:numel(seq)
    %# simulate some missing values
    ind = rand( size(seq{i}) ) < 0.2;
    seq{i}(ind) = nan;
end

The resulting sequences:

seq{1}
ans =
     -0.50782     -0.32058          NaN      -3.0292     -0.45701       1.2424          NaN      0.93373          NaN    -0.029006
seq{2}
ans =
      0.18245      -1.5651    -0.084539       1.6039     0.098348     0.041374     -0.73417
seq{3}
ans =
          NaN          NaN      0.42639     -0.37281     -0.23645       2.0237      -2.2584       2.2294

Edit:

Based on the responses, I think there's been a confusion: obviously I'm not working with random data, the code shown above is simply an example of how the data is structured.

The actual data is some form of processed signals. The problem is that during the analysis, my solution would fail if the sequences contain missing values, hence the need for filtering/interpolation (I already considered using the mean of each sequence to fill the blanks, but I am hoping for something more powerful)

A: 

As JudoWill says, you need to assume some sort of relationship between your data.

One trivial option would be to compute the mean of your total series, and use those for missing data. Another trivial option would be to take the mean of the n previous and n next values.

But be very careful with this: if you're missing data, you're generally better to deal with those missing data, than to make up some fake data that could screw up your analysis.

Kena
+3  A: 

Well, if you're working with time-series data then you can use Matlab's built in interpolation function.

Something like this should work for your situation, but you'll need to tailor it a little ... ie. if you don't have equal spaced sampling you'll need to modify the times line.

nseq = cell(size(seq))
for i = 1:numel(seq)
    times = 1:length(seq{i});
    mask =  ~isnan(seq{i});
    nseq{i} = seq{i};
    nseq{i}(~mask) = interp1(times(mask), seq{i}(mask), times(~mask));

end

You'll need to play around with the options of interp1 to figure out which ones work best for your situation.

JudoWill
thanks, in my case I need to change the `times` vector since values are recorded on a 3 seconds basis
Dave
... Now that I'm thinking about it, it doesn't matter as long as the sequences are equally sampled, isn't it?
Dave
yeah, as long as they are equally sampled it doesn't really matter ... but I try to be as explicit as possible.
JudoWill
+1  A: 

If you have access to the System Identification Toolbox, you can use the MISDATA function to estimate missing values. According to the documentation:

This command linearly interpolates missing values to estimate the first model. Then, it uses this model to estimate the missing data as parameters by minimizing the output prediction errors obtained from the reconstructed data.

Basically the algorithm alternates between estimating missing data and estimating models, in a way similar to the Expectation Maximization (EM) algorithm.

The model estimated can be any of the linear models idmodel (AR/ARX/..), or if non given, uses a default-order state-space model.

Here's how to apply it to your data:

for i=1:numel(seq)
    dat = misdata( iddata(seq{i}(:)) );
    seq{i} = dat.OutputData;
end
Amro
+2  A: 

I would use inpaint_nans, a tool designed to replace nan elements in 1-d or 2-d matrices by interpolation.

seq{1} = [-0.50782 -0.32058 NaN -3.0292 -0.45701 1.2424 NaN 0.93373 NaN -0.029006];
seq{2} = [0.18245 -1.5651 -0.084539 1.6039 0.098348 0.041374 -0.73417];
seq{3} = [NaN NaN 0.42639 -0.37281 -0.23645 2.0237];

for i = 1:3
  seq{i} = inpaint_nans(seq{i});
end

seq{:}
ans =
 -0.50782 -0.32058 -2.0724 -3.0292 -0.45701 1.2424 1.4528 0.93373 0.44482 -0.029006

ans =
  0.18245 -1.5651 -0.084539 1.6039 0.098348 0.041374 -0.73417

ans =
  2.0248 1.2256 0.42639 -0.37281 -0.23645 2.0237
woodchips
+1 thanks woodchips
Dave

related questions