ansaurus

Question

Answer 1

+10 A:

One nice feature: Reading data uses connections which can be local files, remote files accessed via http, pipes from other programs or more.

As a simple example, consider this access for N=10 random integers between min=100 and max=200 from random.org (which supplies true random numbers based on atmospheric noise rather than a pseudo random number generator):

R> site <- "http://random.org/integers/"         # base URL
R> query <- "num=10&min=100&max=200&col=2&base=10&format=plain&rnd=new"
R> txt <- paste(site, query, sep="?")            # concat url and query string
R> nums <- read.table(file=txt)                  # and read the data
R> nums                                          # and show it
   V1  V2
1 165 143
2 107 118
3 103 132
4 191 100
5 138 185
R>

As an aside, the random package provides several convenience functions for accessing random.org.

Dirk Eddelbuettel 2009-08-18 20:50:29

BTW-- I'd suggest that you *should* make selfanswers CW if (1) you post them promptly and (2) you don't make the question CW. Otherwise it looks a bit like you're trying to game the rep system. YMMV and all that.

dmckee 2009-08-18 21:27:03

It's not gaming the system, just getting things started. He's still free to accept any other answer.

ars 2009-08-19 00:45:24

@ars: He's free to accept this one. Nor am I going to attempt to force him to wiki it if he won;t take my advice. But I won't post a prepared selfanswer without marking it wiki, and I won't vote for one without it either. Take that for what it's worth.

dmckee 2009-08-19 01:28:18

good points all. Some self answers are good for discussion. We do this in real life too.

JD Long 2009-08-19 14:53:21

@Dirk: it is wholly acceptable, even encouraged by Jeff and Joel, to answer your own question. There is NO requirement, not even an informal one, to make your answer CW. You're clearly not gaming the system. Once again, just ignore the community wiki police.

Juliet 2009-08-20 12:13:15

I have to agree that part of the sites purpose is to provide best answers for common problems and a general resource. Posing a questions and providing a good answer can help bolster a topic. This is especially useful with new/small tags such as R.

kpierce8 2009-08-21 14:42:01

Answer 2

+6 A:

Upon Dirk's advice, I am posting single examples. I hope they are not too "cute" [clever, but I don't care] or trivial for this audience.

Linear models are the bread and butter of R. When the number of independent variables is high, one has two choices. The first is to it use lm.fit(), which receives the design matrix x and the response y as arguments, similarly to Matlab. The drawback to this approach is that the return value is a list of objects (fitted coefficients, residuals, etc), not an object of class "lm", which can be nicely summarized, used for prediction, stepwise selection, etc. The second approach is create a formula:

> A
           X1         X2          X3         X4         y
1  0.96852363 0.33827107 0.261332257 0.62817021 1.6425326
2  0.08012755 0.69159828 0.087994158 0.93780481 0.9801304
3  0.10167545 0.38119304 0.865209832 0.16501662 0.4830873
4  0.06699458 0.41756415 0.258071616 0.34027775 0.7508766
   ...

> (f=paste("y ~",paste(names(A)[1:4],collapse=" + ")))
[1] "y ~ X1 + X2 + X3 + X4"

> lm(formula(f),data=A)

Call:
lm(formula = formula(f), data = A)

Coefficients:
(Intercept)           X1           X2           X3           X4  
    0.78236      0.95406     -0.06738     -0.43686     -0.06644

gappy 2009-08-19 01:50:10

How about if you pick one per post and illustrate with an example? We can then keep going for days on end and post new examples with new commands... [ BTW: As I recall, you need as.formula(paste(...)) for formula use. ]

Dirk Eddelbuettel 2009-08-19 01:53:43

You do not need the explicitly formula creation to cover all columns as the form "y ~ . - 1" covers it. The "." means 'all columns except the dependent variable, and the '- 1' excludes the constant as in your example.

Dirk Eddelbuettel 2009-08-19 12:10:14

That's right for this specific example, but for X with ncols>>nrows, I often remove some independent variables, especially in the final stages of the analysis. In this case, creating a formula from the data frame names is still handy.

gappy 2009-08-19 13:52:17

Answer 3

+27 A:

str() tells you the structure of any object.

hadley 2009-08-19 02:05:02

Nice, but I wish they had used a different name.

2010-02-14 06:59:40

Python uses `dir()` - makes more sense.

Hamish Grubijan 2010-08-25 16:19:02

How does that make more sense? `str` is short for structure. Normally `dir` is short for directory.

hadley 2010-08-25 21:16:05

Ah, `str` is also short for `string` in many languages.

Hamish Grubijan 2010-08-26 00:24:15

Answer 4

+2 A:

Another trick. Some packages, like glmnet, only take as inputs the design matrix and the response variable. If one wants to fit a model with all interactions between features, she can't use the formula "y ~ .^2". Using expand.grid() allows us to take advantage of the powerful array indexing and vector operations of R.

interArray=function(X){
    n=ncol(X)
    ind=expand.grid(1:n,1:n)
    return(X[,ind[,1]]*X[,ind[,2]])
}

> X
          X1         X2
1 0.96852363 0.33827107
2 0.08012755 0.69159828
3 0.10167545 0.38119304
4 0.06699458 0.41756415
5 0.08187816 0.09805104

> interArray(X)
           X1          X2        X1.1        X2.1
1 0.938038022 0.327623524 0.327623524 0.114427316
2 0.006420424 0.055416073 0.055416073 0.478308177
3 0.010337897 0.038757974 0.038757974 0.145308137
4 0.004488274 0.027974536 0.027974536 0.174359821
5 0.006704033 0.008028239 0.008028239 0.009614007

gappy 2009-08-19 06:28:38

If a modelling function doesn't accept a formula (which is very rare!) wouldn't it be better to construct the design matrix with `model.matrix`?

hadley 2009-08-19 12:14:34

Nice one. I didn't know of the existence of this function. The function above is equivalent to model.matrix(~.^2 -1,X) But regarding passing matrices, aside from glmnet, it is frequent for me to pass array pointers to custom C functions. Indeed, I wouldn't know how to pass a formula to a function. Do you have a toy example?

gappy 2009-08-19 13:47:36

Answer 5

+4 A:

Here is an annoying workaround to convert a factor into a numeric. (Similar for other data types as well)

old.var <- as.numeric(levels(old.var))[as.numeric(old.var)]

Ryan Rosario 2009-08-19 15:03:10

Maybe you meant "into a characater" vector. In which case "as.character(old.var)" is simpler.

Dirk Eddelbuettel 2009-08-19 16:03:26

I've always thought this advice (which can be read at ?factor) to be misguided. You have to be sure old.var is a factor, and this will vary according on the options you set for the R session. Using as.numeric(as.character(old.var)) is both safer and cleaner.

Eduardo Leoni 2009-08-20 04:32:29

Really not worth a downvote, but whatever. This works for me.

Ryan Rosario 2009-08-20 05:45:59

Ryan - Could you fix your code? If old.var <- factor(1:2); your code will give [1] "1" "2" (not numeric.) perhaps you meant as.numeric(levels(old.var)[old.var])?

Eduardo Leoni 2009-08-20 07:40:41

Or slightly more efficiently: `as.numeric(levels(old.var))[old.var]`

hadley 2009-08-20 12:53:28

What about: as.numeric(as.character(old.var)) ?

andrewj 2009-08-21 00:32:16

Thanks for editing. But you don't need the as.numeric inside the index. The reason: typeof(factor(1:3)) is integer. Damn you factors!

Eduardo Leoni 2009-08-21 11:35:47

Answer 6

+15 A:

Don't know how well known this is/isn't, but something that I've definitely taken advantage of are the pass-by-reference capabilities of environments.

zz <- new.env()
zz$foo <- c(1,2,3,4,5)
changer <- function(blah) {
   blah$foo <- 5
}
changer(zz)
zz$foo

For this example it doesn't make sense why it'd be useful, but if you're passing large objects around it can help.

geoffjentry 2009-08-19 15:12:52

Answer 7

+11 A:

?ave

Subsets of 'x[]' are averaged, where each subset consist of those observations with the same factor levels. Usage: ave(x, ..., FUN = mean)

I use it all the time. (e.g. in this answer here at so)

Eduardo Leoni 2009-08-19 20:34:06

when you used that on my 'mixed merge' question it was the first time I had seen it. I'm really glad you showed me this.

JD Long 2009-08-21 00:54:41

Totally agree, ave() is very useful.

andrewj 2009-08-21 15:40:37

Answer 8

+2 A:

One of my favorite, if not somewhat unorthodox tricks, is the use of eval() and parse(). This example perhaps illustrates how it can be helpful

NY.Capital <- 'Albany'
state <- 'NY'
parameter <- 'Capital'
eval(parse(text=paste(state, parameter, sep='.')))

[1] "Albany"

This type of situation occurs more often than not, and use of eval() and parse() can help address it. Of course, I welcome any feedback on alternative ways of coding this up.

andrewj 2009-08-21 00:44:09

This can be done as well with named vector elements.

Dirk Eddelbuettel 2009-08-21 02:01:04

library(fortunes);fortune(106) If the answer is parse() you should usually rethink the question. -- Thomas Lumley R-help (February 2005)

Eduardo Leoni 2009-08-21 11:40:52

Here's an example where eval() and parse() can be useful. This involves a Bioconductor package, e.g. hgu133a.db and where you are trying to obtain various pieces of information about a probeset id. For example:library(hgu133a.db)parameter <- 'SYMBOL'mget('202431_s_at', env=eval(parse(text=paste('hgu133a',parameter, sep=''))))parameter <- 'ENTREZID'mget('202431_s_at', env=eval(parse(text=paste('hgu133a',parameter, sep=''))))

andrewj 2009-08-21 15:53:16

As Dirk says, this is better done with named vector elements, or ` get(paste(state, parameter, sep='.'))`

hadley 2009-08-21 15:55:38

@Hadley, didn't know that you could use get() that way. Thanks.

andrewj 2009-08-21 20:44:07

Answer 9

+5 A:

A way to speed up code and eliminate for loops.

instead of for loops that loop through a dataframe looking for values. just take a subset of the df with those values, much quicker.

so instead of:

for(i in 1:nrow(df)){
  if (df$column[i] == x) {
    df$column2[i] <- y
    or any other similiar code
  }
}

do something like this:

df$column2[df$column1 == x] <- y

that base concept is applicable extremely often and is a great way to get rid of for loops

Dan 2009-08-21 01:11:09

There is a small trap here that used to catch me up all the time. If df$column1 contains NA values, subsetting using == will pull out any values that equal x *and* any NAs. To avoid this, use "%in%" instead of "==".

Matt Parker 2009-08-21 15:03:07

Matt you're absolutely right and it's something that I hate, I like your method though. I usually check the column for NAs and then remove them with a quick function i made that takes a dataframe column and returns the dataframe minus rows with NAs in just that column.

Dan 2009-08-22 20:52:04

You mean `na.omit`? ;)

hadley 2009-08-23 23:21:04

essentially, i pare a dataframe down to the columns i need to have values then use na.omit to get the correct rows and then subset the original dataset with only those rows. Just using na.omit would remove any row with any NA, I could be mistaken though.

Dan 2009-08-25 21:41:44

Answer 10

+2 A:

To perform an operation on a number of variables in a data frame. This is stolen from subset.data.frame.

get.vars<-function(vars,data){
    nl <- as.list(1L:ncol(data))
    names(nl) <- names(data)
    vars <- eval(substitute(vars), nl, parent.frame())
    data[,vars]
    #do stuff here
}

get.vars(c(cyl:hwy,class),mpg)

Ian Fellows 2009-08-21 03:34:10

This seems cool at first, but this sort of code will cause you no end of trouble in the long run. It's always better to be explicit.

hadley 2009-08-21 15:57:10

hum, I've been using this trick quite a bit as of late. Could you be more specific about its unbounded trouble?

Ian Fellows 2009-08-21 20:41:10

Maybe hadley is suggesting using the plyr package instead?

Christopher DuBois 2009-08-22 22:25:37

No, this isn't a veiled suggestion to use plyr instead.The basically problem with your code is that it is semantically lazy - instead of making the user explicitly spell out what they want, you do some "magic" to guess. The problem with this is that it makes the function very hard to program with - i.e. it's difficult to write a function that calls `get.vars` without jumping through a whole lot of hoops.

hadley 2009-08-23 14:00:26

Answer 11

+20 A:

head() and tail() to get the first and last parts of a dataframe, vector, matrix, function, etc. Especially with large data frames, this is a quick way to check that it has loaded ok.

Rob Hyndman 2009-08-21 07:11:34

Answer 12

+1 A:

I've posted this once before but I use it so much I thought I'd post it again. Its just a little function to return the names and position numbers of a data.frame. Its nothing special to be sure, but I almost never make it through a session without using it multiple times.

##creates an object from a data.frame listing the column names and location

namesind=function(df){

temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)

}

ni <- namesind

kpierce8 2009-08-21 15:31:09

This is a really a one-liner:`data.frame(VAR = names(df), COL = seq_along(df))`

hadley 2009-08-21 15:58:26

very elegant, maybe I'll switch it to ni <- function(df){data.frame(VAR = names(df), COL = seq_along(df))}

kpierce8 2009-08-21 22:59:58

I use:data.frame(colnames(the.df))

Tal Galili 2010-03-07 12:24:19

Answer 13

+3 A:

Definitively system(). To be able to have access to all the unix tools (at least under Linux/MacOSX) from inside the R environment has rapidly become invaluable in my daily workflow.

Paolo 2009-08-22 06:33:50

That ties into my earlier comment about connections: you can also use pipe() to pass data from, or to, Unix commands. See `help(connections)` for details and examples.

Dirk Eddelbuettel 2009-08-22 16:27:25

Thanks, very useful!

Paolo 2009-08-22 16:31:45

Thanks! Brilliant.

Thrawn 2009-12-01 16:15:55

Answer 14

+9 A:

Use backticks to reference non standard names.

> df <- data.frame(x=rnorm(5),y=runif(5))
> names(df) <- 1:2
> df
           1         2
1 -1.2035003 0.6989573
2 -1.2146266 0.8272276
3  0.3563335 0.0947696
4 -0.4372646 0.9765767
5 -0.9952423 0.6477714
> df$1
Error: unexpected numeric constant in "df$1"
> df$`1`
[1] -1.2035003 -1.2146266  0.3563335 -0.4372646 -0.9952423

In this case, df[,"1"] would also work. But back ticks work inside formulas!

> lm(`2`~`1`,data=df)

Call:
lm(formula = `2` ~ `1`, data = df)

Coefficients:
(Intercept)          `1`  
     0.4087      -0.3440

[Edit] Dirk asks why one would give invalid names? I don't know! But I certainly encounter this problem in practice fairly often. For example, using hadley's reshape package:

> library(reshape)
> df$z <- c(1,1,2,2,2)
> recast(df,z~.,id.var="z")
Aggregation requires fun.aggregate: length used as default
  z (all)
1 1     4
2 2     6
> recast(df,z~.,id.var="z")$(all)
Error: unexpected '(' in "recast(df,z~.,id.var="z")$("
> recast(df,z~.,id.var="z")$`(all)`
Aggregation requires fun.aggregate: length used as default
[1] 4 6

Eduardo Leoni 2009-08-22 16:15:55

Ok, but _why_ would you need to replace syntactically valid names (like x or y) with invalid ones (like 1 or 2) requiring the backticks?

Dirk Eddelbuettel 2009-08-22 16:28:32

It's also useful in `read.table` when `check.names` is false - i.e. when you want to work with the original column names.

hadley 2009-08-24 02:23:52

Answer 15

+1 A:

set.seed() sets the random number generator state.

For example:

> set.seed(123)
> rnorm(1)
[1] -0.5604756
> rnorm(1)
[1] -0.2301775
> set.seed(123)
> rnorm(1)
[1] -0.5604756

Christopher DuBois 2009-08-22 22:51:20

super useful with examples that use random functions... helps get everyone on the same page

JD Long 2009-08-25 01:58:53

Answer 16

+5 A:

CrossTable() from the gmodels package provides easy access to SAS- and SPSS-style crosstabs, along with the usual tests (Chisq, McNemar, etc.). Basically, it's xtabs() with fancy output and some additional tests - but it does make sharing output with the heathens easier.

Matt Parker 2009-08-25 21:42:37

Nice!! I use gmodels quite a bit, but missed that one

Abhijit 2009-09-05 09:57:43

Good answer, anything that can keep me away from excessive explanation of tables with the heathens is a good use of time.

Stedy 2010-04-27 04:44:05

Answer 17

+9 A:

Data Input trick = RGoogleDocs package

http://www.omegahat.org/RGoogleDocs/

I have found Google spreadsheets to be a fantastic way for all collaborators to be on the same page. Furthermore, Google Forms allows one to capture data from respondents and effortlessly write it to a google spreadsheet. Since data changes frequently and is almost never final it is far preferable for R to read a google spreadsheet directly than to futz with downloading csv files and reading them in.

# Get data from google spreadsheet
library(RGoogleDocs)
ps <-readline(prompt="get the password in ")
auth = getGoogleAuth("[email protected]", ps, service="wise")
sheets.con <- getGoogleDocsConnection(auth)
ts2=getWorksheets("Data Collection Repos",sheets.con)
names(ts2)
init.consent <-sheetAsMatrix(ts2$Sheet1,header=TRUE, as.data.frame=TRUE, trim=TRUE)

I cannot rembember which but one or two of the following commands takes several seconds.

getGoogleAuth
getGoogleDocsConnection
getWorksheets

Farrel 2009-09-03 20:55:30

Answer 18

+1 A:

It seems I cannot comment (maybe it has to do with this "reputation" business)

Anyway further to the RGoogleDocs tips above:

ps <-readline(prompt="get the password in ")

This won't work from within Emacs, which I like to use for R, with ESS of course.

On Linux, you can use zenity to get the password from user input, and set it to hide the input, so as an additional benefit, your password is not plaintext on your screen:

mypass <- system("zenity --entry --hide-text",intern=TRUE)

Marianne 2009-09-16 09:24:59

Answer 19

+3 A:

I'm really surprised no one has posted about apply, tapply, lapply, and sapply. A general rule I use when doing stuff in R is that if I have a for loop that is doing data processing or simulations, I try to factor it out and replace it with an *apply. Some people shy away from the *apply functions because they think only single parameter functions can be passed in. Nothing could be further from the truth! Like passing around functions with parameters as first class objects in Javascript, you do this in R with anonymous functions. For example:

 > sapply(rnorm(100, 0, 1), round)
  [1]  1  1  0  1  1 -1 -2  0  2  2 -2 -1  0  1 -1  0  1 -1  0 -1  0  0  0  0  0
 [26]  2  0 -1 -2  0  0  1 -1  1  5  1 -1  0  1  1  1  2  0 -1  1 -1  1  0 -1  1
 [51]  2  1  1 -2 -1  0 -1  2 -1  1 -1  1 -1  0 -1 -2  1  1  0 -1 -1  1  1  2  0
 [76]  0  0  0 -2 -1  1  1 -2  1 -1  1  1  1  0  0  0 -1 -3  0 -1  0  0  0  1  1


> sapply(rnorm(100, 0, 1), round(x, 2)) # How can we pass a parameter?
Error in match.fun(FUN) : object 'x' not found


# Wrap your function call in an anonymous function to use parameters
> sapply(rnorm(100, 0, 1), function(x) {round(x, 2)})
  [1] -0.05 -1.74 -0.09 -1.23  0.69 -1.43  0.76  0.55  0.96 -0.47 -0.81 -0.47
 [13]  0.27  0.32  0.47 -1.28 -1.44 -1.93  0.51 -0.82 -0.06 -1.41  1.23 -0.26
 [25]  0.22 -0.04 -2.17  0.60 -0.10 -0.92  0.13  2.62  1.03 -1.33 -1.73 -0.08
 [37]  0.45 -0.93  0.40  0.05  1.09 -1.23 -0.35  0.62  0.01 -1.08  1.70 -1.27
 [49]  0.55  0.60 -1.46  1.08 -1.88 -0.15  0.21  0.06  0.53 -1.16 -2.13 -0.03
 [61]  0.33 -1.07  0.98  0.62 -0.01 -0.53 -1.17 -0.28 -0.95  0.71 -0.58 -0.03
 [73] -1.47 -0.75 -0.54  0.42 -1.63  0.05 -1.90  0.40 -0.01  0.14 -1.58  1.37
 [85] -1.00 -0.90  1.69 -0.11 -2.19 -0.74  1.34 -0.75 -0.51 -0.99 -0.36 -1.63
 [97] -0.98  0.61  1.01  0.55

# Note that anonymous functions aren't being called, but being passed.
> function() {print('hello #rstats')}()
function() {print('hello #rstats')}()
> a = function() {print('hello #rstats')}
> a
function() {print('hello #rstats')}
> a()
[1] "hello #rstats"

(For those that follow #rstats, I also posted this there).

Remember, use apply, sapply, lapply, tapply, and do.call! Take avantage of R's vectorization. You should never walk up to a bunch of R code and see:

N = 10000
l = numeric()
for (i in seq(1:N)) {
    sim <- rnorm(1, 0, 1)
    l <- rbind(l, sim)
}

Not only is this not vectorized, but the array structure in R is not grown as it is in Python (doubling size when space runs out, IIRC). So each rbind step must first grow l enough to accept the results from rbind(), then copy all over the previous l's contents. For fun, try the above in R. Notice how long it takes (you won't even need Rprof or any timing function). Then try

N=10000
l <- rnorm(N, 0, 1)

The following is better than the first version too:

N = 10000
l = numeric(N)
for (i in seq(1:N)) {
    sim <- rnorm(1, 0, 1)
    l[i] <- sim
}

Vince 2009-09-23 03:03:52

apply, sapply, lapply and tapply are useful. If you want to pass parameters to a named function like round, you can just pass it along with apply instead of writing an anonymous function. Try "sapply(rnorm(10, 0, 1), round, digits=2)" which outputs "[1] -0.29 0.29 1.31 -0.06 -1.90 -0.84 0.21 0.02 0.23 -1.10".

Daniel 2009-11-03 13:13:13

Answer 20

+8 A:

Sometimes you need to rbind multiple data frames. do.call() will let you do that (someone had to explain this to me when bind I asked this question, as it doesn't appear to be an obvious use).

foo <- list()

foo[[1]] <- data.frame(a=1:5, b=11:15)
foo[[2]] <- data.frame(a=101:105, b=111:115)
foo[[3]] <- data.frame(a=200:210, b=300:310)

do.call(rbind, foo)

andrewj 2009-09-28 19:46:11

Good call: I find that this is often simpler than using `unsplit`.

Richie Cotton 2009-11-18 16:27:31

Answer 21

+8 A:

My new favorite thing is the foreach library. It lets you do all of the nice apply things, but with a somewhat easier syntax:

list_powers <- foreach(i = 1:100) %do% {
  lp <- x[i]^i
  return (lp)
}

The best part is that if you are doing something that actually requires a significant amount of time, you can switch from %do% to %dopar% (with the appropriate backend library) to instantly parallelize, even across a cluster. Very slick.

JAShapiro 2009-10-14 23:45:37

Answer 22

+3 A:

R just loves to create factors and then falls down over itself when the factor levels are missing, so my best trick is the following function to eliminate factor levels that do not appear in the data.

drop.levels = function (dat) {
if (is.factor(dat)) 
    dat <- dat[, drop = TRUE]
else dat[] <- lapply(dat, function(x) x[, drop = TRUE])
return(dat) }

kevin 2009-11-03 15:43:01

I'm stealing this one! :)

Roman Luštrik 2010-08-12 09:43:32

Answer 23

+1 A:

As a recent R addict, I love the ?function_name and use it all the time

-k

knguyen 2009-11-07 17:09:10

@Dirk: Just a side note. You are everywhere :) !

knguyen 2009-11-07 17:10:02

That's just a lack of focus :) In Emacs/ESS, ?? and ??? also work but I use them way less.

Dirk Eddelbuettel 2009-11-07 17:39:09

Answer 24

+4 A:

You can assign a value returning from an if-else block.

Instead of, e.g.

condition <- runif(1) > 0.5
if(condition) x <- 1 else x <- 2

you can do

x <- if(condition) 1 else 2

Exactly how this works is deep magic.

Richie Cotton 2009-12-01 15:04:28

You could also do this like x <- ifelse(condition, 1, 2), in which case each component is vectorized.

Shane 2009-12-01 15:36:46

Shane, you could, but unless you really deeply grok what ifelse() does, you probably shouldn't! It's an easy one to misunderstand...

Harlan 2010-01-14 19:53:59

Answer 25

+6 A:

I do a lot of basic manipulation of data, so here are two built-in functions ( transform , subset ) and one library ( sqldf ) that I use daily.

create sample sales data

sales <- expand.grid(country = c('USA', 'UK', 'FR'),
                     product = c(1, 2, 3))
sales$revenue <- rnorm(dim(sales)[1], mean=100, sd=10)

> sales
  country product   revenue
1     USA       1 108.45965
2      UK       1  97.07981
3      FR       1  99.66225
4     USA       2 100.34754
5      UK       2  87.12262
6      FR       2 112.86084
7     USA       3  95.87880
8      UK       3  96.43581
9      FR       3  94.59259

use transform() to add a column

## transform currency to euros
usd2eur <- 1.434
transform(sales, euro = revenue * usd2eur)

>
  country product   revenue     euro
1     USA       1 108.45965 155.5311
2      UK       1  97.07981 139.2125
3      FR       1  99.66225 142.9157
...

use subset() to slice the data

subset(sales, 
       country == 'USA' & product %in% c(1, 2), 
       select = c('product', 'revenue'))

>
  product  revenue
1       1 108.4597
4       2 100.3475

use sqldf() to slice and aggregate with SQL

The sqldf package provides an SQL interface to R data frames

##  recast the previous subset() expression in SQL
sqldf('SELECT product, revenue FROM sales \
       WHERE country = "USA" \
       AND product IN (1,2)')

>
  product  revenue
1       1 108.4597
2       2 100.3475

Perform an aggregation or GROUP BY

sqldf('select country, sum(revenue) revenue \ 
       FROM sales \
       GROUP BY country')

>
  country  revenue
1      FR 307.1157
2      UK 280.6382
3     USA 304.6860

For more sophisticated map-reduce-like functionality on data frames, check out the plyr package. And if find yourself wanting to pull your hair out, I recommend checking out Data Manipulation with R.

dataspora 2009-12-23 22:10:48

Answer 26

+18 A:

One very useful function I often use is dput(), which allows you to dump an object in the form of R code.

# Use the iris data set
R> data(iris)
# dput of a numeric vector
R> dput(iris$Petal.Length)
c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 
1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1, 1.7, 1.9, 
1.6, 1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.4, 
1.3, 1.5, 1.3, 1.3, 1.3, 1.6, 1.9, 1.4, 1.6, 1.4, 1.5, 1.4, 4.7, 
4.5, 4.9, 4, 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4, 4.7, 
3.6, 4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4, 4.9, 4.7, 4.3, 4.4, 4.8, 
5, 4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4, 
4.4, 4.6, 4, 3.3, 4.2, 4.2, 4.2, 4.3, 3, 4.1, 6, 5.1, 5.9, 5.6, 
5.8, 6.6, 4.5, 6.3, 5.8, 6.1, 5.1, 5.3, 5.5, 5, 5.1, 5.3, 5.5, 
6.7, 6.9, 5, 5.7, 4.9, 6.7, 4.9, 5.7, 6, 4.8, 4.9, 5.6, 5.8, 
6.1, 6.4, 5.6, 5.1, 5.6, 6.1, 5.6, 5.5, 4.8, 5.4, 5.6, 5.1, 5.1, 
5.9, 5.7, 5.2, 5, 5.2, 5.4, 5.1)
# dput of a factor levels
R> dput(levels(iris$Species))
c("setosa", "versicolor", "virginica")

It can be very useful to post easily reproducible data chunks when you ask for help, or to edit or reorder the levels of a factor.

juba 2010-01-13 10:22:07

very VERY cool - thanks!

Tal Galili 2010-03-07 12:18:46

Answer 27

+2 A:

Although this question has been up for a while I recently discovered a great trick on the SAS and R blog for using the command cut. The command is used to divide data into categories and I will use the iris dataset as an example and divide it into 10 categories:

> irisSL <- iris$Sepal.Length
> str(irisSL)
 num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
> cut(irisSL, 10)
  [1] (5.02,5.38] (4.66,5.02] (4.66,5.02] (4.3,4.66]  (4.66,5.02] (5.38,5.74] (4.3,4.66]  (4.66,5.02] (4.3,4.66]  (4.66,5.02]
 [11] (5.38,5.74] (4.66,5.02] (4.66,5.02] (4.3,4.66]  (5.74,6.1]  (5.38,5.74] (5.38,5.74] (5.02,5.38] (5.38,5.74] (5.02,5.38]
 [21] (5.38,5.74] (5.02,5.38] (4.3,4.66]  (5.02,5.38] (4.66,5.02] (4.66,5.02] (4.66,5.02] (5.02,5.38] (5.02,5.38] (4.66,5.02]
 [31] (4.66,5.02] (5.38,5.74] (5.02,5.38] (5.38,5.74] (4.66,5.02] (4.66,5.02] (5.38,5.74] (4.66,5.02] (4.3,4.66]  (5.02,5.38]
 [41] (4.66,5.02] (4.3,4.66]  (4.3,4.66]  (4.66,5.02] (5.02,5.38] (4.66,5.02] (5.02,5.38] (4.3,4.66]  (5.02,5.38] (4.66,5.02]
 [51] (6.82,7.18] (6.1,6.46]  (6.82,7.18] (5.38,5.74] (6.46,6.82] (5.38,5.74] (6.1,6.46]  (4.66,5.02] (6.46,6.82] (5.02,5.38]
 [61] (4.66,5.02] (5.74,6.1]  (5.74,6.1]  (5.74,6.1]  (5.38,5.74] (6.46,6.82] (5.38,5.74] (5.74,6.1]  (6.1,6.46]  (5.38,5.74]
 [71] (5.74,6.1]  (5.74,6.1]  (6.1,6.46]  (5.74,6.1]  (6.1,6.46]  (6.46,6.82] (6.46,6.82] (6.46,6.82] (5.74,6.1]  (5.38,5.74]
 [81] (5.38,5.74] (5.38,5.74] (5.74,6.1]  (5.74,6.1]  (5.38,5.74] (5.74,6.1]  (6.46,6.82] (6.1,6.46]  (5.38,5.74] (5.38,5.74]
 [91] (5.38,5.74] (5.74,6.1]  (5.74,6.1]  (4.66,5.02] (5.38,5.74] (5.38,5.74] (5.38,5.74] (6.1,6.46]  (5.02,5.38] (5.38,5.74]
[101] (6.1,6.46]  (5.74,6.1]  (6.82,7.18] (6.1,6.46]  (6.46,6.82] (7.54,7.9]  (4.66,5.02] (7.18,7.54] (6.46,6.82] (7.18,7.54]
[111] (6.46,6.82] (6.1,6.46]  (6.46,6.82] (5.38,5.74] (5.74,6.1]  (6.1,6.46]  (6.46,6.82] (7.54,7.9]  (7.54,7.9]  (5.74,6.1] 
[121] (6.82,7.18] (5.38,5.74] (7.54,7.9]  (6.1,6.46]  (6.46,6.82] (7.18,7.54] (6.1,6.46]  (5.74,6.1]  (6.1,6.46]  (7.18,7.54]
[131] (7.18,7.54] (7.54,7.9]  (6.1,6.46]  (6.1,6.46]  (5.74,6.1]  (7.54,7.9]  (6.1,6.46]  (6.1,6.46]  (5.74,6.1]  (6.82,7.18]
[141] (6.46,6.82] (6.82,7.18] (5.74,6.1]  (6.46,6.82] (6.46,6.82] (6.46,6.82] (6.1,6.46]  (6.46,6.82] (6.1,6.46]  (5.74,6.1] 
10 Levels: (4.3,4.66] (4.66,5.02] (5.02,5.38] (5.38,5.74] (5.74,6.1] (6.1,6.46] (6.46,6.82] (6.82,7.18] ... (7.54,7.9]

Stedy 2010-01-14 19:18:29

Answer 28

+4 A:

The traceback() function is a must when you have an error somewhere and do not understand it readily. It will print a trace of the stack, very helpful as R is not very verbose by default.

Then setting options(error=recover) will allow you to "enter" into the function raising the error and try and understand what happens exactly, as if you had full control over it and could put a browser() in it.

These three functions can really help debugging your code.

Calimo 2010-05-19 12:01:59

`options(error=recover)` is my favorite debugging method.

Joshua Ulrich 2010-08-03 11:57:08

Answer 29

+7 A:

In R programming (not interactive sessions), I use if (bad.condition) stop("message") a lot. Every function starts with a few of these, and as I work through computations, I pepper these in, too. I guess I got into the habit from using assert() in C. The benefits are two-fold. First, it's a lot faster to get working code with these checks in place. Second, and probably more important, it is a lot easier to work with existing code when you see these checks on every screen in your editor. You won't have to wonder whether x>0, or trust a comment stating that it is ... you'll know, from a glance, that it is.

PS. my first post here. Be gentle!

dan 2010-06-05 20:49:23

Not a bad habit, and R offers yet another way: `stopfifnot(!bad.condition)` which is more concise.

Dirk Eddelbuettel 2010-06-05 20:54:15

First post = necromancer!

Joshua 2010-08-03 16:50:56

Answer 30

+4 A:

As a total noob to R and a novice at stats I love unclass() to print all elements of a data frame as an ordinary list.

It's pretty handy for a look at a complete data set all in one go to quickly eyeball any potential issues.

John 2010-07-16 03:42:32

What?! A new R person who actually answered something instead of just asking a basic question and then disappearing?! I don't believe it.

Matt Parker 2010-07-16 22:13:57

Er... by which I mean: welcome!

Matt Parker 2010-07-16 22:16:50

Thanks, don't worry though, there will be more questions than answers.

John 2010-07-17 00:24:05

Answer 31

+1 A:

For those who are writing C to be called from R: .Internal(inspect(...)) is handy. For example:

> .Internal(inspect(quote(a+2)))
  @867dc28 06 LANGSXP g0c0 [] 
  @8436998 01 SYMSXP g1c0 [MARK,gp=0x4000] "+"
  @85768b0 01 SYMSXP g1c0 [MARK,NAM(2)] "a"
  @8d7bf48 14 REALSXP g0c1 [] (len=1, tl=0) 2

Joshua Ulrich 2010-08-03 12:01:38

ansaurus

tags:

views:

answers:

What is the most useful R trick?

Data Input trick = RGoogleDocs package

create sample sales data

use transform() to add a column

use subset() to slice the data

use sqldf() to slice and aggregate with SQL

related questions