tags:

views:

681

answers:

2

I have a CSV of file of data that I can load in R using read.csv()

Some of the data is missing, so I want to reduce the data frame down to set that consists entirely of non-missing data, i.e. if a NULL appears anywhere, I want to exclude that column and row from the filtered data set.

I know I can probably do this fairly simply with the inbuilt R vector operations, but I am not quite sure how to do this exactly?

To make my question a little more concrete, here is a quick sample of the data so you can see what I want to do.

DocID       Anno1    Anno7  Anno8
1           7        NULL   8
2           8        NULL   3
44          10       2      3
45          6        6      6
46          1        3      4
49          3        8      5
62          4        NULL   9
63          2        NULL   4
67          11       NULL   3
91          NULL     9      7
92          NULL     7      5
93          NULL     8      8

So given this input, I need some code that will reduce the output to this.

DocID       Anno8
44          3
45          6
46          4
49          5

As Anno8 is the only column with non-NULL data, and there are only four rows with non-NULL data.

+4  A: 

If x is your data.frame (or matrix) then

x[ ,apply(x, 2, function(z) !any(is.na(z)))]

Since your example uses NULL, is.na will be replaced by is.null

Alternatively you can look at subset

rguha
+9  A: 

You can drop any row containing a missing using na.omit(), however that's not what you want. Moreover, the currently accepted answer is wrong. It gives you complete columns, but does not drop the rows that have one or more missing values, which is what was asked for. The correct answer can be obtained as:

> a <- data.frame(a=c(1,2),b=c(NA,1), c=c(3,4))
> a
  a  b c
1 1 NA 3
2 2  1 4
> na.omit(a)[,colSums(is.na(a))==0]
  a c
2 2 4

To see that the above answer is wrong:

> a[ ,apply(a, 2, function(z) !any(is.na(z)))]
  a c
1 1 3
2 2 4

Row 1 should be dropped because of the NA in column 2.