views:

3394

answers:

11

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:

  1. Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.

  2. The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).

  3. The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).

  4. Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.

  5. Write report incorporating tables and graphics.

  6. Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.

At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.

Thanks!

PS:

Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ ".RData" suffix) and scripts (".R" suffix). Make uses timestamps to check dependencies, so if you 'touch ss07por.csv', it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!

http://www.gnu.org/software/make/manual/html%5Fnode/index.html#Top

R=/home/wsprague/R-2.9.2/bin/R

persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
   $R --slave -f ImportData.R

persondata.Munged.RData : MungeData.R persondata.RData Functions.R
      $R --slave -f MungeData.R

report.txt:  TabulateAndGraph.R persondata.Munged.RData Functions.R
      $R --slave -f TabulateAndGraph.R > report.txt

+3  A: 

I use Sweave for the report-producing side of this, but I've also been hearing about the brew package - though I haven't yet looked into it.

Essentially, I have a number of surveys for which I produce summary statistics. Same surveys, same reports every time. I built a Sweave template for the reports (which takes a bit of work). But once the work is done, I have a separate R script that lets me point out the new data. I press "Go", Sweave dumps out a few score .tex files, and I run a little Python script to pdflatex them all. My predecessor spent ~6 weeks each year on these reports; I spend about 3 days (mostly on cleaning data; escape characters are hazardous).

It's very possible that there are better approaches now, but if you do decide to go this route, let me know - I've been meaning to put up some of my Sweave hacks, and that would be a good kick in the pants to do so.

Matt Parker
Would love to see some of these "Sweave hacks". It's giving me a headache!
Brandon Bertelsen
+1  A: 

Agreed that Sweave is the way to go, with xtable for generating LaTeX tables. Although I haven't spent too much time working with them, the recently released tikzDevice package looks really promising, particularly when coupled with pgfSweave (which, as far as I know is only available on rforge.net at this time -- there is a link to r-forge from there, but it's not responding for me at the moment).

Between the two, you'll get consistent formatting between text and figures (fonts, etc.). With brew, these might constitute the holy grail of report generation.

Kevin
pgfSweave is currently in "development limbo" as the developers haven't had time to incorporate the new tikzDevice. For now we suggest using tikzDevice from within normal Sweave documents-- the user just has to take responsibility for opening/closing the device and \including{} the resulting output.
Sharpie
+2  A: 

I'm going to suggest something in a different sort of direction from the other submitters, based on the fact that you asked specifically about project workflow, rather than tools. Assuming you're relatively happy with your document-production model, it sounds like your challenges really may be centered more around issues of version tracking, asset management, and review/publishing process.

If that sounds correct, I would suggest looking into an integrated ticketing/source management/documentation tool like Redmine. Keeping related project artifacts such as pending tasks, discussion threads, and versioned data/code files together can be a great help even for projects well outside the traditional "programming" bailiwick.

rcoder
+8  A: 

I agree with the other responders: Sweave is excellent for report writing with R. And rebuilding the report with updated results is as simple as re-calling the Sweave function. It's completely self-contained, including all the analysis, data, etc. And you can version control the whole file.

I use the StatET plugin for Eclipse for developing the reports, and Sweave is integrated (Eclipse recognizes latex formating, etc). On Windows, it's easy to use MikTEX.

I would also add, that you can create beautiful reports with Beamer. Creating a normal report is just as simple. I included an example below that pulls data from Yahoo! and creates a chart and a table (using quantmod). You can build this report like so:

Sweave(file = "test.Rnw")

Here's the Beamer document itself:

% 
\documentclass[compress]{beamer}
\usepackage{Sweave}
\usetheme{PaloAlto} 
\begin{document}

\title{test report}
\author{john doe}
\date{September 3, 2009} 

\maketitle

\begin{frame}[fragile]\frametitle{Page 1: chart}

<<echo=FALSE,fig=TRUE,height=4, width=7>>=
library(quantmod)
getSymbols("PFE", from="2009-06-01")
chartSeries(PFE)
@

\end{frame}


\begin{frame}[fragile]\frametitle{Page 2: table}

<<echo=FALSE,results=tex>>=
library(xtable)
xtable(PFE[1:10,1:4], caption = "PFE")
@

\end{frame}

\end{document}
Shane
Don't believe that an Sweave report is reproducible until you test it on a clean machine. It's easy to have implicit external dependencies.
John D. Cook
+21  A: 

If you'd like to see some examples, I have a few small (and not so small) data cleaning and analysis projects available online. In most, you'll find a script to download the data, one to clean it up, and a few to do exploration and analysis:

Recently I have started numbering the scripts, so it's completely obvious in which order they should be run. (If I'm feeling really fancy I'll sometimes make it so that the exploration script will call the cleaning script which in turn calls the download script, each doing the minimal work necessary - usually by checking for the presence of output files with file.exists. However, most times this seems like overkill).

I use git for all my projects (a source code management system) so its easy to collaborate with others, see what is changing and easily roll back to previous versions.

If I do a formal report, I usually keep R and latex separate, but I always make sure that I can source my R code to produce all the code and output that I need for the report. For the sorts of reports that I do, I find this easier and cleaner than working with latex.

hadley
I commented about Makefiles above, but you might want to look into them -- it is the traditional dependency checking language. Also -- I am going to try to learn ggplot2 -- looks great!
ws
I like the idea of having a way to specify dependencies between files, but having to learn m4 is a big turn off. I wish there was something like raken written in R.
hadley
You don't need m4 for make, unless you need to do crazy things. ...
ws
+1  A: 

At a more "meta" level, you might be interested in the CRISP-DM process model.

Jouni K. Seppänen
A: 

For writing a quick preliminary report or email to a colleague, I find that it can be very efficient to copy-and-paste plots into MS Word or an email or wiki page -- often best is a bitmapped screenshot (e.g. on mac, Apple-Shift-(Ctrl)-4). I think this is an underrated technique.

For a more final report, writing R functions to easily regenerate all the plots (as files) is very important. It does take more time to code this up.

On the larger workflow issues, I like Hadley's answer on enumerating the code/data files for the cleaning and analysis flow. All of my data analysis projects have a similar structure.

Brendan OConnor
+1  A: 

I'll add my voice to sweave. For complicated, multi-step analysis you can use a makefile to specify the different parts. Can prevent having to repeat the whole analysis if just one part has changed.

PaulHurleyuk
+31  A: 

I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

Josh Reich
That's a really good workflow. I have struggled with designing a workflow and when I ask those around me they generally respond, "what? workflow? huh?" So I take it they don't think about this much. I'm going to adopt the Reichian LCFD model.
JD Long
this is pretty close to my workflow, I have often an import script, analysis script and reporting script
kpierce8
LCFD: Least Commonly Fouled-up Data
William Doane
+5  A: 

I just wanted to add, in case anyone missed it, that there's a great post on the learnr blog about creating repetitive reports with Jeffrey Horner's brew package. Matt and Kevin both mentioned brew above. I haven't actually used it much myself.

The entries follows a nice workflow, so it's well worth a read:

  1. Prepare the data.
  2. Prepare the report template.
  3. Produce the report.

Actually producing the report once the first two steps are complete is very simple:

library(tools)
library(brew)
brew("population.brew", "population.tex")
texi2dvi("population.tex", pdf = TRUE)
Shane
In fixing a small grammatical error I messed up the wordpress.com addressing. So the correct link is http://learnr.wordpress.com/2009/09/09/brew-creating-repetitive-reports/
learnr
A: 

"make" is great because (1) you can use it for all your work in any language (unlike, say, Sweave and Brew), (2) it is very powerful (enough to build all the software on your machine), and (3) it avoids repeating work. This last point is important to me because a lot of the work is slow; when I latex a file, I like to see the result in a few seconds, not the hour it would take to recreate the figures.

dan