tags:

views:

262

answers:

3

Suppose I'm messing about with some data by binding vectors together, as I'm wont to do on a lazy sunday afternoon.

    x <- rnorm(25, mean = 65, sd = 10)
    y <- rnorm(25, mean = 75, sd = 7)
    z <- 1:25

    dd <- data.frame(mscore = x, vscore = y, caseid = z)

I've now got my new dataframe dd, which is wonderful. But there's also still the detritus from my prior slicings and dicings:

    > ls()
    [1] "dd"        "x"          "y"          "z"         

What's a simple way to clean up my workspace if I no longer need my "source" columns, but I want to keep the dataframe? That is, now that I'm done manipulating data I'd like to just have dd and none of the smaller variables that might inadvertently mask further analysis:

    > ls()
    [1] "dd"

I feel like the solution must be of the form rm(ls[ -(dd) ]) or something, but I can't quite figure out how to say "please clean up everything BUT the following objects."

+8  A: 

Here is an approach using setdiff:

rm(list=setdiff(ls(), "dd"))
rcs
@rcs - that's quite clever. Is my problem something common in data cleaning? I imagined it had to be, but perhaps it's because I'm a novice. Is it weird to have an interactive bit of slice n dice, followed by the need to clean up the bits that are no longer necessary?
briandk
i find that to be very common, i like having a trim workspace
Dan
I never use `rm`. It usually doesn't matter that you have a few interim pieces lying around - if it's happening a lot, it's probably a sign that you should create a function.
hadley
I think @hadley's right. My worry was that having them lying around might mask other analyses that I'm doing. Here's an example. caseid <- 1:25 height <- rnorm(25, mean = 150, sd = 15) hd <- data.frame(caseid, height) hd <- hd [-(7), ] # Removing a case library(ggplot2) qplot(x = caseid, y = height, data = hd) # Plots 25 pointsIn the initial lines, I want to give meaningful variable names. But then those global variables seem to mask the ones local to the dataframe in my plot call. I assume this means I need to adopt a better practice?
briandk
I'm sorry that my above code didn't format correctly. I can't seem to get Markdown to cooperate :-(
briandk
Global variables don't mask local variables in qplot.
hadley
A: 

Since I forgot that comments don't support full formatting, I wanted to respond to Hadley's recommendation here. Some of my existing code--perhaps sloppily--tends to work like this:

    caseid <- 1:25
    height <- rnorm(25, mean = 150, sd = 15)
    hd     <- data.frame(caseid, height)
    hd     <- hd [-(7), ] # Removing a case
    library(ggplot2)
    qplot(x = caseid, y = height, data = hd) # Plots 25 points

In the above code, qplot() will plot 25 points, and I think it's because my global variables caseid and height are masking its attempt to access them locally from the provided dataframe. So, the case that I removed still seems to get plotted, because it appears in the global variables, though not the dataframe hd at the time of the qplot() call.

My sense is that this behavior is entirely expected, and that the answer here is that I'm following a suboptimal coding practice. So, how can I start writing code that avoids these kinds of inadvertent collisions?

briandk
Hmm... I don't think that's what's going on. For example, I made an hd2 that only had ten rows from hd. If the global caseid and height were masking in the qplot(), it wouldn't matter what the data argument takes, right? But I definitely get 10 points with hd2.
Matt Parker
1. Are you SURE you're counting correctly?
Matt Parker
2. If you're making caseid and height explicitly to go into the data.frame and never need them as their own vectors, you can make `hd <- data.frame(caseid = 1:25, height = rnorm(25, mean = 150, sd = 15))`.
Matt Parker
@Matt - those are great suggestions. Sometimes I can't generate them in situ in the data.frame command, though, so I think Fotjaseek nailed a good solution.Out of curiosity, can you share your code that generates only ten datapoints?
briandk
Sure - I just tweaked your subsetting of hd: `hd <- hd[-(1:15), ]` and then ran your qplot line. Coming back to it fresh this morning, I forgot to explicitly state the "data = " part of data = hd, and that did give me all 25 points because qplot didn't know what to do with hd.
Matt Parker
+4  A: 

I would approach this by making a separate environment in which to store all the junk variables, making your data frame using with(), then copying the ones you want to keep into the main environment. This has the advantage of being tidy, but also keeping all your objects around in case you want to look at them again.

temp <- new.env()
with(temp, {
    x <- rnorm(25, mean = 65, sd = 10) 
    y <- rnorm(25, mean = 75, sd = 7) 
    z <- 1:25 
    dd <- data.frame(mscore = x, vscore = y, caseid = z)
    }
)

dd <- with(temp,dd)

This gives you:

> ls()
[1] "dd"   "temp"
> with(temp,ls())
[1] "dd" "x"  "y"  "z" 

and of course you can get rid of the junk environment if you really want to.

Fojtasek
Or use `local` like `dd <- local({x<-....; data.frame(msscore=x,...)})` and there is no `temp`. `local` returns last expression so last line should return `dd`.
Marek