tags:

views:

73

answers:

3

Dear all,

I´m trying to clean the factor variables in a dataframe from trailing spaces. However the levels assignment doesnt work inside my lapply function.

rm.space<-function(x){
    a<-gsub(" ","",x)
    return(a)}


lapply(names(barn),function(x){
    levels(barn[,x])<-rm.space(levels(barn[,x]))
    })

Any ideas how I can assign levels inside a lapply function?

//M

+6  A: 

R is vectorised, you do not need apply():

> f <- as.factor(sample(c("  a", " b", "c", "  d"), 10, replace=TRUE))                                                                                                             
> levels(f)                                                                                                                                                                        
[1] "  a" " b"  "c"   "  d"                                                                                                                                                        
> levels(f) <- gsub(" +", "", levels(f), perl=TRUE)                                                                                                                                
> levels(f)                                                                                                                                                                        
[1] "a" "b" "c" "d"                                                                                                                                                                
> f                                                                                                                                                                                
 [1] d a c b c d d a a a                                                                                                                                                           
Levels: a b c d                                                                                                                                                                    
>
Dirk Eddelbuettel
True, R is vectorized, but removing spaces is not vectorized over the different variables in a dataframe. This is what OP needs, I believe from his code.
Joris Meys
+1  A: 

From your code I read that the lapply is used to loop over different variables, not over the levels of the factor. So then you do need some kind of looping structure, but lapply is a bad choice:

  • you loop over a vector -names(barn)- so it's better to use sapply
  • the apply family will return the result from each loop, something you don't want. So you're using memory without purpose.

Anyway, in case you need to assign something to a variable in your global environment within a lapply, you need the <<- operator. Say you need to have a number of variables you selected where the spaces have to be removed:

f <- paste("",letters[1:5])

Df <- data.frame(
    X1 = sample(f,10,r=T),
    X2 = sample(f,10,r=T),
    X3 = sample(f,10,r=T)
    )

# Bad example :   
lapply(c("X1","X3"),function(x){
    levels(Df[,x])<<-gsub(" +","",levels(Df[,x]))
    })

gives

> str(Df)
'data.frame':   10 obs. of  3 variables:
 $ X1: Factor w/ 3 levels "a","b","c": 2 3 1 1 1 2 3 2 2 2
 $ X2: Factor w/ 5 levels " a"," b"," c",..: 4 5 4 2 5 5 1 2 5 3
 $ X3: Factor w/ 5 levels "a","b","c","d",..: 2 3 4 1 4 1 3 3 5 4

Better is to use a for loop :

for( i in c("X1","X3")){
    levels(Df[,i])<-gsub(" +","",levels(Df[,i]))
}

Does what you need without the hassle of the <<- operator and without holding memory unnecessarily.

Joris Meys
Thx Joris. I did it with a for loop first but encouraged by all the books stating that everything should be done within the ..apply framework I figured I´d give it a try... I was not aware of the <<- operator. Thx again. //M
Misha
@Misha : I know for-loops are frowned upon in the R community, especially for the fact that they operate in the global environment. the apply family avoids this, making it safer to use in many cases. Yet, here you want to work in your global environment, so a for-loop is the correct tool for that. See also : http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar
Joris Meys
A: 

As Joris states lapply works on local copy of data.frame, so it won't modify your original data. But you could use it to replace your data:

barn[] <- lapply(barn, function(x) {
    levels(x) <- rm.space(levels(x))
    x
    })

It is useful when you have different types in data and want to modify only factor's, e.g.:

factors <- sapply(barn, is.factor)
barn[factors] <- lapply(barn[factors], function(x) {
                    levels(x) <- rm.space(levels(x))
                    x
                 })
Marek