tags:

views:

680

answers:

11

This morning I was reading an interview with Bradford Cross of FlightCaster over on the DataWrangling blog. Bradford mentions in the interview that one of his biggest challenges has been the wrangling of their data. Even though I am an economist I find that I spend more than 70% of my time doing reformatting, ETL, cleaning, etc. I spend much less time doing actual econometrics.

If you were to counsel a relatively new R user on which data wrangling method and commands to learn, which would you tell them to focus on and why?

A: 

Sadly, I do not think that there is a magic bullet, or magic wand.

So my recommendation is to learn the language, period, as it is the tool to 'program with data'.

Many CRAN packages are surely of help (e.g. for data input/output as in database connections, for report creation, for domain-specific modeling, ...) but the crux lies in the actual data work.

Dirk Eddelbuettel
It's a totally valid point to say one has to learn the language. But for the beginner it's an issue of maximizing utility from R subject to a time constraint. From what I've been doing the last few days I feel I need to get much better with melt and with Hadley's plyr tools. Learning those two could save me lots of time.
JD Long
+4  A: 

I use indexing all the damn time. Fairly basic, but completely essential.

merge() and subset() are also incredibly useful.

Knowing how to wrangle factors is fairly vital, if you use that sort of data.

Finally, for getting the data into R, the package RODBC is handy - it put my introduction to SQL into a cozy, familiar environment. I've also found that learning the parameters of read.table() is useful, too - even though the defaults work so well most of the time, a lot of things are easiest to deal with right when you're bringing the data in.

Matt Parker
Update: Sweet Mother of God, I love merge().
Matt Parker
+3  A: 

This is not specific to R, but regular expressions are very handy for ill-formatted data. In R, look at grep/grepl, and gsub. And agrep for approximate matching of strings (not regular expressions.)

Eduardo Leoni
+11  A: 

The reshape package has some nice tools (melt and cast) changing the data from a wide to a long format and vice versa.

Thierry
Agreed. I've been meaning to work on my fluency with this package for some time now. Completely changed the way I think about data.
Matt Parker
Melt got my attention when Jonathan Chang used it in his answer here: http://stackoverflow.com/questions/1313954/ I thought to myself, "Self, you gotta learn melt... what other methods like this am I missing"
JD Long
+1  A: 

Write regular expressions using regexpr, gsub, or grep. Make them a function and use lapply, sapply, and do.call. I can't emphasize this last part (a functional approach) enough; R is a vectorized language and these commands are very fast!

There is also the XML package which allows with extraction with XPath. Highly recommended!

Hope this helps!

Vince
+4  A: 

Apply and friends: tapply, lapply, sapply, aggregate, by.

hadley
+9  A: 

The plyr package.

hadley
Naturally! GGPLOT2 is pretty good too. :)
JD Long
+3  A: 

I have to admit that I feel a little dirty using this one, but I get huge mileage out of the package slqdf which allows using SQL syntax on a data frame. It has allowed me to leverage my kick ass SQL skills in R land. It also allows me to write some really odd code that's pretty inefficient. But it does what I want and I get analysis out the door faster because I use it. It's a trade-off.

Earlier SO question on sqldf

Google Code page on sqldf

It's pretty dang amusing how it works. My read of the docs (I have not delved into the source code) is that it pushes the data frame into SQLite, executes an SQLite query on the table and then passes back the result. Pretty slick way to get SQL syntax against a data frame... albeit of dubious code efficiency. Definitely from the school of 'code that gets shit done.' Pragmatic and ingenious.

JD Long
+1  A: 

Another useful set of functions are those that read data into R: readLines() with strsplit(), read.table() and its very useful parameter colClasses.

Jeff
+3  A: 

ave() is a great, underappreciated function that will sort of is like tapply() and will let you determine values by levels. For example, take the data frame foo below and you wanted to find the cities with the highest population by state. ave() will return a vector as follows.

foo <- data.frame(state=c('California', 'California','Texas', 'Texas'),
        city=c('LA', 'SF', 'Austin', 'Fort Worth'), 
        population=c(3.83e6, 8e5, 7.57e5, 7.03e5))    
ave(foo$population, foo$state, FUN=function(x) rank(-1*x))
andrewj
+3  A: 

The person who wrote above that sqldf has dubious efficiency should note that performance is rarely intuitive. Despite the fact that sqldf is intended for convenience rather than speed several users have done speed comparisons and each found that sqldf was actually faster in the cases they benchmarked than performing the manipulations in plain R. The users posted their results. Links to their original postings can be found in the December 9, 2009 News item on the sqldf home page.

G. Grothendieck
Welcome to stackoverflow, Gabor!
Shane
Gabor, I'm rally glad that you've joined Stack Overflow. You're comments on sqldf() are spot on. I've been using sqldf as my prefered method for pulling big(ish) data into R with great success (http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/). Fast and easy. Just how I like it! ;)
JD Long