views:

358

answers:

3

Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.

Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.

This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.

From the Wikipedia article on Software development:

The term software development is often used to refer to the activity of computer programming, which is the process of writing and maintaining the source code, whereas the broader sense of the term includes all that is involved between the conception of the desired software through to the final manifestation of the software. Therefore, software development may include research, new development, modification, reuse, re-engineering, maintenance, or any other activities that result in software products. For larger software systems, usually developed by a team of people, some form of process is typically followed to guide the stages of production of the software.

According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.

Which leads me to my question: what differences can you outline between the two activities?

It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.

This question came to me from the following threads:

+1  A: 

If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.

On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

Richie Cotton
The software is merely the tool, like a pencil, paper and an eraser. Understanding is paramount. Code is not the output of analysis, the conclusions of the analysis are. Software/code is used to administer the steps but theorical understanding including implicit and explicit understanding of all aspects the analysis are neccesary. Computers are important but think of performing a statistical task by hand and then the programming argument dries up.
Jay
+6  A: 

As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.

A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.

Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:

  • Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
  • A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.

I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.

Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.

Shane
Many thanks to your reply Shane,I hope to see more like it in scope (though somewhat doubt it :) )Cheers,Tal
Tal Galili
Software development has helped to develop many of these methodologies but this is really just good project or workflow management right? This applies to all fields of work.
Jay
Absolutely. That isn't to say that there aren't different *kinds* of workflows which are more/less suited to different tasks. It's good to be aware of the differences, strenghts/weaknesses.
Shane
100% there are an infinite number of variations with some generalizable patterns. Certainly very beneficial! Thanks Shane.
Jay
+1  A: 

Perhaps the common denominator is "problem solving."

Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.

This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.

Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.

For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).

Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.

For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.

Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)

I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).

Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.

doug

related questions