tags:

views:

56

answers:

2

At some point in my script I like to see the number of missing values in my data.frame and display them. In my case I have:

out <- read.csv(file="...../OUT.csv", na.strings="NULL")

sum(is.na(out$codeHelper))

out[is.na(out$codeHelper),c(1,length(colnames(out)))]

It works perfectly fine. However, the last command obviously gives me the whole data.frame where the NA is TRUE, eg:

5561                  Yemen (PDR) <NA>
5562                  Yemen (PDR) <NA>
5563                  Yemen (PDR) <NA>
5564                  Yemen (PDR) <NA>
5565                  Yemen (PDR) <NA>
5566                  Yemen (PDR) <NA>
5567                  Yemen (PDR) <NA>
5568                  Yemen (PDR) <NA>
5601 Zaire (Democ Republic Congo) <NA>
5602 Zaire (Democ Republic Congo) <NA>
5603 Zaire (Democ Republic Congo) <NA>
5604 Zaire (Democ Republic Congo) <NA>
5605 Zaire (Democ Republic Congo) <NA>

With a big frame and a lot of NAs that looks pretty messy. Important to me is only where the NA occurs i.e which country (in the second column) has a missing value in the third column.

So how can i only display a single row for each country?

It should look something like this:

    1                  Yemen (PDR) <NA>
    2 Zaire (Democ Republic Congo) <NA>
    3                          USA <NA>
    4                     W. Samoa <NA>
+3  A: 

Try something like this:

subset(dataframe.name, !duplicated(country.colname),
       select=c(col1.name, col2.name, ...))

see also this related question: how to remove partial duplicates from a data frame?

rcs
+3  A: 

unique(c(1,2,3,4,4))

will give you

1 2 3 4

so

unique(out[is.na(out$codeHelper),c(1,length(colnames(out)))])

should be what you're looking for?

pufferfish
perfect, i was exactly searching for such a function! thanks!
mropa